How to scrape Rightmove listings with Python and augment the data with commuting time from TFL
Example and context for a python script to scrape Rightmove to extract data from a particular listing and use publicly available APIs to add commute time plus postcode information
The script in this post focuses on extracting the information from a particular Rightmove listing, calculate the time it would take to go from the listing to your workplace (or any postcode!) using National Rail and TFL services and add more information about the listing location using public postcode APIs.
My goal wasn't to do a big area / modelling analysis that would require scraping all the search results but collect the information from the listings I found interesting so I could do comparisons, specific analyses or compile the data and share it via Google Sheets.
The code is not clean since it started with a different focus than CSV exports but even in its current state it should serve as a starting point if your use case is similar.
A few notes and details on the script
Arguments
The script expects the URL to scrape, sqm size and a location.
- sqm size: Not every listing has the size in the respective field but most of the time it can be found in the floor plan or description.
- location: The listing data and the API calls to augment have location names but I found that a lot of the time I wanted to tag a listing with the name I used for the area.
Parsing
Data
The script takes advantage of a JSON formatted field in Rightmove listings which contains all the information displayed on the page.
The field components are parsed into typed dictionaries and dataframes. This logic could be simplified now that the scripts focuses on exporting to CSV.
The components which are annoying to parse because they contain multiple values get their own dataframe and CSV export. They are easy to join to the main data dump when needed but often I didn't use them for any analysis!
User Agent
The script has a list of valid user agents from where a random assignment is drawn every time a listing is scraped. This helps to avoid getting blocked by Rightmove during intense househunting sessions.
Commute Time
The Transport for London Unified API is an awesome free resource you can use to programmatically review and calculate all kinds of travel information around London.
In the script, we use the TFL API to have a reference of how much time it would take to go from the listing to any location in London. The API responses includes National Rail services so listings from outside the Big Smoke are fair game.
The call focuses on a single location as the arrival point (e.g. workplace) traveling in the middle of the week at 8:30am.
Note that we are not adding any complex logic beyond validating the response status. This means that you should check if there might be planned disruptions (e.g. strikes or engineering works) and change the parameters accordingly to avoid getting something useless.
Postcode Info
Sometimes listings don't have the full information about their location. This is common in New Build listings. Using the Royal Mail data is paid but Postcodes.io is a free alternative that provides a lot of the same info. I'm using it to complement the data from the listing in case I need to have more identifiers to join with other stats info like crime.
from argparse import ArgumentParser
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import json
import os
import pandas as pd
import random
import re
import requests
from typing import TypedDict
arg_parser = ArgumentParser()
arg_parser.add_argument("rightmove_url", help="Rightmove Property URL")
arg_parser.add_argument(
"property_size_sqm", help="Property Size in Square Meters", type=float
)
arg_parser.add_argument("manual_location", help="Property Location (Town, City)")
args = arg_parser.parse_args()
class PropertyData(TypedDict):
id: str
published: bool
archived: bool
description: str
propertyPhrase: str
auctionFeesDisclaimer: str
guidePriceDisclaimer: str
reservePriceDisclaimer: str
pageTitle: str
primaryPrice: str
displayPriceQualifier: str
pricePerSqft: str
address: str
outcode: str
incode: str
keyFeatures: list
images: list
floorplans: list
companyName: str
companyBranch: str
companyBranchDisplayName: str
companyIsNewHomeDeveloper: bool
companyLocalNumber: str
rooms: list
latitude: float
longitude: float
nearestStations: list
sizings: list
brochures: list
bedrooms: int
bathrooms: int
tags: list
tenureType: str
tenureMessage: str
propertyType: str
propertySubType: str
sharedOwnership: bool
councilTaxExempt: bool
councilTaxIncluded: bool
annualServiceCharge: str
councilTaxBand: str
copyLinkUrl: str
def convert_price_to_float(primaryPrice: str) -> float:
extracted_price = re.search(r".+?([0-9]+,[0-9]+)", primaryPrice)
extracted_price_group = (
extracted_price.group(1) if extracted_price is not None else ""
)
return float(extracted_price_group.replace(",", ""))
def parse_property_data(json_data: dict) -> PropertyData:
return {
"id": json_data["propertyData"]["id"],
"published": json_data["propertyData"]["status"]["published"],
"archived": json_data["propertyData"]["status"]["archived"],
"description": json_data["propertyData"]["text"]["description"],
"propertyPhrase": json_data["propertyData"]["text"]["propertyPhrase"],
"auctionFeesDisclaimer": json_data["propertyData"]["text"][
"auctionFeesDisclaimer"
],
"guidePriceDisclaimer": json_data["propertyData"]["text"][
"guidePriceDisclaimer"
],
"reservePriceDisclaimer": json_data["propertyData"]["text"][
"reservePriceDisclaimer"
],
"pageTitle": json_data["propertyData"]["text"]["pageTitle"],
"primaryPrice": json_data["propertyData"]["prices"]["primaryPrice"],
"displayPriceQualifier": json_data["propertyData"]["prices"][
"displayPriceQualifier"
],
"pricePerSqft": json_data["propertyData"]["prices"]["pricePerSqFt"],
"address": json_data["propertyData"]["address"]["displayAddress"],
"outcode": json_data["propertyData"]["address"]["outcode"],
"incode": json_data["propertyData"]["address"]["incode"],
"keyFeatures": json_data["propertyData"]["keyFeatures"],
"images": json_data["propertyData"]["images"],
"floorplans": json_data["propertyData"]["floorplans"],
"companyName": json_data["propertyData"]["customer"]["companyName"],
"companyBranch": json_data["propertyData"]["customer"]["branchName"],
"companyBranchDisplayName": json_data["propertyData"]["customer"][
"branchDisplayName"
],
"companyIsNewHomeDeveloper": json_data["propertyData"]["customer"][
"isNewHomeDeveloper"
],
"companyLocalNumber": json_data["propertyData"]["contactInfo"][
"telephoneNumbers"
]["localNumber"],
"rooms": json_data["propertyData"]["rooms"],
"latitude": json_data["propertyData"]["location"]["latitude"],
"longitude": json_data["propertyData"]["location"]["longitude"],
"nearestStations": json_data["propertyData"]["nearestStations"],
"sizings": json_data["propertyData"]["sizings"],
"brochures": json_data["propertyData"]["brochures"],
"bedrooms": json_data["propertyData"]["bedrooms"],
"bathrooms": json_data["propertyData"]["bathrooms"],
"tags": json_data["propertyData"]["tags"],
"tenureType": json_data["propertyData"]["tenure"]["tenureType"],
"tenureMessage": json_data["propertyData"]["tenure"]["message"],
"propertyType": json_data["propertyData"]["soldPropertyType"],
"propertySubType": json_data["propertyData"]["propertySubType"],
"sharedOwnership": json_data["propertyData"]["sharedOwnership"][
"sharedOwnership"
],
"councilTaxExempt": json_data["propertyData"]["livingCosts"][
"councilTaxExempt"
],
"councilTaxIncluded": json_data["propertyData"]["livingCosts"][
"councilTaxIncluded"
],
"annualServiceCharge": json_data["propertyData"]["livingCosts"][
"annualServiceCharge"
],
"councilTaxBand": json_data["propertyData"]["livingCosts"]["councilTaxBand"],
"copyLinkUrl": json_data["metadata"]["copyLinkUrl"],
}
def convert_base_property_data_to_df(property_data: PropertyData) -> pd.DataFrame:
filtered_property_data = {
"id": property_data["id"],
"published": property_data["published"],
"archived": property_data["archived"],
"description": property_data["description"],
"propertyPhrase": property_data["propertyPhrase"],
"auctionFeesDisclaimer": property_data["auctionFeesDisclaimer"],
"guidePriceDisclaimer": property_data["guidePriceDisclaimer"],
"reservePriceDisclaimer": property_data["reservePriceDisclaimer"],
"pageTitle": property_data["pageTitle"],
"primaryPrice": property_data["primaryPrice"],
"displayPriceQualifier": property_data["displayPriceQualifier"],
"pricePerSqft": property_data["pricePerSqft"],
"address": property_data["address"],
"outcode": property_data["outcode"],
"incode": property_data["incode"],
"keyFeatures": " | ".join(property_data["keyFeatures"]),
"companyName": property_data["companyName"],
"companyBranch": property_data["companyBranch"],
"companyBranchDisplayName": property_data["companyBranchDisplayName"],
"companyIsNewHomeDeveloper": property_data["companyIsNewHomeDeveloper"],
"companyLocalNumber": property_data["companyLocalNumber"],
"latitude": property_data["latitude"],
"longitude": property_data["longitude"],
"bedrooms": property_data["bedrooms"],
"bathrooms": property_data["bathrooms"],
"tags": " | ".join(property_data["tags"]),
"tenureType": property_data["tenureType"],
"tenureMessage": property_data["tenureMessage"],
"propertyType": property_data["propertyType"],
"propertySubType": property_data["propertySubType"],
"sharedOwnership": property_data["sharedOwnership"],
"councilTaxExempt": property_data["councilTaxExempt"],
"councilTaxIncluded": property_data["councilTaxIncluded"],
"annualServiceCharge": property_data["annualServiceCharge"],
"councilTaxBand": property_data["councilTaxBand"],
"copyLinkUrl": property_data["copyLinkUrl"],
}
return pd.DataFrame(filtered_property_data, index=[0])
def convert_nearest_stations_to_df(property_data: PropertyData) -> pd.DataFrame:
nearest_stations = property_data["nearestStations"]
property_id = property_data["id"]
nearest_stations_df = pd.DataFrame(nearest_stations)
nearest_stations_df = nearest_stations_df.assign(id=property_id)
return nearest_stations_df
def convert_list_field_to_df(list_field: list, id: str) -> pd.DataFrame:
list_field_df = pd.DataFrame(list_field)
list_field_df = list_field_df.assign(id=id)
return list_field_df
def get_next_weekday(startdate, weekday):
"""
@startdate: given date, in format '2013-05-25'
@weekday: week day as a integer, between 0 (Monday) to 6 (Sunday)
"""
d = datetime.strptime(startdate, "%Y-%m-%d")
t = timedelta((7 + weekday - d.weekday()) % 7)
return (d + t).strftime("%Y%m%d")
user_agent_list = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15",
]
user_agent = random.choice(user_agent_list)
headers = {
"User-Agent": user_agent,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
web_response = requests.get(args.rightmove_url, headers=headers)
soup = BeautifulSoup(web_response.text, "html.parser")
property_json_string_id = re.compile("PAGE_MODEL")
property_html = soup.find("script", string=property_json_string_id)
re_results = re.search(
r"<script>\s+window.PAGE_MODEL = (.+)\s+</script>", str(property_html)
)
re_results_group = re_results.group(1) if re_results is not None else ""
property_json = json.loads(re_results_group)
property_data = parse_property_data(property_json)
converted_primary_price = convert_price_to_float(property_data["primaryPrice"])
price_per_sqm = converted_primary_price / args.property_size_sqm
snapshot_date = datetime.today().strftime("%Y-%m-%d")
property_data_df = convert_base_property_data_to_df(property_data)
property_data_df = property_data_df.assign(numericalPrice=converted_primary_price)
property_data_df = property_data_df.assign(property_size_sqm=args.property_size_sqm)
property_data_df = property_data_df.assign(pricePerSqm=price_per_sqm)
property_data_df = property_data_df.assign(location=args.manual_location)
property_data_df = property_data_df.assign(snapshotDate=snapshot_date)
nearest_stations_df = convert_list_field_to_df(
property_data["nearestStations"], property_data["id"]
)
property_rooms_df = convert_list_field_to_df(
property_data["rooms"], property_data["id"]
)
brochures_df = convert_list_field_to_df(property_data["brochures"], property_data["id"])
sizings_df = convert_list_field_to_df(property_data["sizings"], property_data["id"])
tfl_api_to_location = "YOUR ARRIVAL POSTCODE GOES HERE"
tfl_api_journey_date = get_next_weekday(datetime.today().strftime("%Y-%m-%d"), 3)
tfl_api_journey_time = "0830"
tfl_api_key = "YOUR API KEY GOES HERE"
tfl_api_from_lat_lon = (
str(property_data["latitude"]) + "," + str(property_data["longitude"])
)
tfl_api_from_postcode = property_data["outcode"] + property_data["incode"]
tfl_api_headers = {"Cache-Control": "no-cache"}
tfl_api_url = (
"https://api.tfl.gov.uk/Journey/JourneyResults/"
+ tfl_api_from_lat_lon
+ "/to/"
+ tfl_api_to_location
+ "?nationalSearch=true&date="
+ tfl_api_journey_date
+ "&time="
+ tfl_api_journey_time
+ "&timeIs=Departing&journeyPreference=LeastTime"
+ "&accessibilityPreference=NoRequirements"
+ "&maxTransferMinutes=120&maxWalkingMinutes=120&walkingSpeed=Average"
+ "&cyclePreference=None&bikeProficiency=Easy&api_key="
+ tfl_api_key
)
postcode_api_url = "https://api.postcodes.io/postcodes/" + tfl_api_from_postcode
postcode_location_api_url = (
"https://api.postcodes.io/postcodes?lon="
+ str(property_data["longitude"])
+ "&lat="
+ str(property_data["latitude"])
)
journey_response = requests.get(tfl_api_url, headers=tfl_api_headers)
journey_json = json.loads(journey_response.text)
postcode_api_response = requests.get(postcode_location_api_url)
postcode_info_json = json.loads(postcode_api_response.text)
if journey_response.status_code == 200:
journey_duration = journey_json["journeys"][0]["duration"]
else:
journey_duration = None
if postcode_api_response.status_code == 200:
postcode_admin_district = postcode_info_json["result"][0]["admin_district"]
postcode_parish = postcode_info_json["result"][0]["parish"]
postcode_admin_county = postcode_info_json["result"][0]["admin_county"]
else:
postcode_admin_district = ""
postcode_parish = ""
postcode_admin_county = ""
property_data_df = property_data_df.assign(journey_duration=journey_duration)
property_data_df = property_data_df.assign(postcode_parish=postcode_parish)
property_data_df = property_data_df.assign(
postcode_admin_district=postcode_admin_district
)
property_data_df = property_data_df.assign(postcode_admin_county=postcode_admin_county)
# --- Saving dataframes to CSV
path = "property_csv"
if not os.path.exists(path):
os.mkdir(path)
print("Folder %s created!" % path)
property_data_df.to_csv(
"./property_csv/" + "[" + snapshot_date + "]" + property_data["id"] + "[base].csv"
)
nearest_stations_df.to_csv(
"./property_csv/"
+ "["
+ snapshot_date
+ "]"
+ property_data["id"]
+ "[stations].csv"
)
property_rooms_df.to_csv(
"./property_csv/" + "[" + snapshot_date + "]" + property_data["id"] + "[rooms].csv"
)
brochures_df.to_csv(
"./property_csv/"
+ "["
+ snapshot_date
+ "]"
+ property_data["id"]
+ "[brochures].csv"
)
sizings_df.to_csv(
"./property_csv/"
+ "["
+ snapshot_date
+ "]"
+ property_data["id"]
+ "[sizings].csv"
)
with open(
"./property_csv/" + "[" + snapshot_date + "]" + property_data["id"] + "[base].json",
"w",
) as f:
f.write(json.dumps(property_data))
print("CSV files written for property " + property_data["id"])