By Spooler in Coding — Nov 2, 2023

How to scrape Rightmove listings with Python and augment the data with commuting time from TFL

Example and context for a python script to scrape Rightmove to extract data from a particular listing and use publicly available APIs to add commute time plus postcode information

The script in this post focuses on extracting the information from a particular Rightmove listing, calculate the time it would take to go from the listing to your workplace (or any postcode!) using National Rail and TFL services and add more information about the listing location using public postcode APIs.

My goal wasn't to do a big area / modelling analysis that would require scraping all the search results but collect the information from the listings I found interesting so I could do comparisons, specific analyses or compile the data and share it via Google Sheets.

The code is not clean since it started with a different focus than CSV exports but even in its current state it should serve as a starting point if your use case is similar.

A few notes and details on the script

Arguments

The script expects the URL to scrape, sqm size and a location.

sqm size: Not every listing has the size in the respective field but most of the time it can be found in the floor plan or description.
location: The listing data and the API calls to augment have location names but I found that a lot of the time I wanted to tag a listing with the name I used for the area.

Parsing

Data

The script takes advantage of a JSON formatted field in Rightmove listings which contains all the information displayed on the page.

The field components are parsed into typed dictionaries and dataframes. This logic could be simplified now that the scripts focuses on exporting to CSV.

The components which are annoying to parse because they contain multiple values get their own dataframe and CSV export. They are easy to join to the main data dump when needed but often I didn't use them for any analysis!

User Agent

The script has a list of valid user agents from where a random assignment is drawn every time a listing is scraped. This helps to avoid getting blocked by Rightmove during intense househunting sessions.

Commute Time

The Transport for London Unified API is an awesome free resource you can use to programmatically review and calculate all kinds of travel information around London.

In the script, we use the TFL API to have a reference of how much time it would take to go from the listing to any location in London. The API responses includes National Rail services so listings from outside the Big Smoke are fair game.

The call focuses on a single location as the arrival point (e.g. workplace) traveling in the middle of the week at 8:30am.

Note that we are not adding any complex logic beyond validating the response status. This means that you should check if there might be planned disruptions (e.g. strikes or engineering works) and change the parameters accordingly to avoid getting something useless.

Postcode Info

Sometimes listings don't have the full information about their location. This is common in New Build listings. Using the Royal Mail data is paid but Postcodes.io is a free alternative that provides a lot of the same info. I'm using it to complement the data from the listing in case I need to have more identifiers to join with other stats info like crime.

from argparse import ArgumentParser
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import json
import os
import pandas as pd
import random
import re
import requests
from typing import TypedDict

arg_parser = ArgumentParser()
arg_parser.add_argument("rightmove_url", help="Rightmove Property URL")
arg_parser.add_argument(
    "property_size_sqm", help="Property Size in Square Meters", type=float
)
arg_parser.add_argument("manual_location", help="Property Location (Town, City)")
args = arg_parser.parse_args()


class PropertyData(TypedDict):
    id: str
    published: bool
    archived: bool
    description: str
    propertyPhrase: str
    auctionFeesDisclaimer: str
    guidePriceDisclaimer: str
    reservePriceDisclaimer: str
    pageTitle: str
    primaryPrice: str
    displayPriceQualifier: str
    pricePerSqft: str
    address: str
    outcode: str
    incode: str
    keyFeatures: list
    images: list
    floorplans: list
    companyName: str
    companyBranch: str
    companyBranchDisplayName: str
    companyIsNewHomeDeveloper: bool
    companyLocalNumber: str
    rooms: list
    latitude: float
    longitude: float
    nearestStations: list
    sizings: list
    brochures: list
    bedrooms: int
    bathrooms: int
    tags: list
    tenureType: str
    tenureMessage: str
    propertyType: str
    propertySubType: str
    sharedOwnership: bool
    councilTaxExempt: bool
    councilTaxIncluded: bool
    annualServiceCharge: str
    councilTaxBand: str
    copyLinkUrl: str


def convert_price_to_float(primaryPrice: str) -> float:
    extracted_price = re.search(r".+?([0-9]+,[0-9]+)", primaryPrice)
    extracted_price_group = (
        extracted_price.group(1) if extracted_price is not None else ""
    )
    return float(extracted_price_group.replace(",", ""))


def parse_property_data(json_data: dict) -> PropertyData:
    return {
        "id": json_data["propertyData"]["id"],
        "published": json_data["propertyData"]["status"]["published"],
        "archived": json_data["propertyData"]["status"]["archived"],
        "description": json_data["propertyData"]["text"]["description"],
        "propertyPhrase": json_data["propertyData"]["text"]["propertyPhrase"],
        "auctionFeesDisclaimer": json_data["propertyData"]["text"][
            "auctionFeesDisclaimer"
        ],
        "guidePriceDisclaimer": json_data["propertyData"]["text"][
            "guidePriceDisclaimer"
        ],
        "reservePriceDisclaimer": json_data["propertyData"]["text"][
            "reservePriceDisclaimer"
        ],
        "pageTitle": json_data["propertyData"]["text"]["pageTitle"],
        "primaryPrice": json_data["propertyData"]["prices"]["primaryPrice"],
        "displayPriceQualifier": json_data["propertyData"]["prices"][
            "displayPriceQualifier"
        ],
        "pricePerSqft": json_data["propertyData"]["prices"]["pricePerSqFt"],
        "address": json_data["propertyData"]["address"]["displayAddress"],
        "outcode": json_data["propertyData"]["address"]["outcode"],
        "incode": json_data["propertyData"]["address"]["incode"],
        "keyFeatures": json_data["propertyData"]["keyFeatures"],
        "images": json_data["propertyData"]["images"],
        "floorplans": json_data["propertyData"]["floorplans"],
        "companyName": json_data["propertyData"]["customer"]["companyName"],
        "companyBranch": json_data["propertyData"]["customer"]["branchName"],
        "companyBranchDisplayName": json_data["propertyData"]["customer"][
            "branchDisplayName"
        ],
        "companyIsNewHomeDeveloper": json_data["propertyData"]["customer"][
            "isNewHomeDeveloper"
        ],
        "companyLocalNumber": json_data["propertyData"]["contactInfo"][
            "telephoneNumbers"
        ]["localNumber"],
        "rooms": json_data["propertyData"]["rooms"],
        "latitude": json_data["propertyData"]["location"]["latitude"],
        "longitude": json_data["propertyData"]["location"]["longitude"],
        "nearestStations": json_data["propertyData"]["nearestStations"],
        "sizings": json_data["propertyData"]["sizings"],
        "brochures": json_data["propertyData"]["brochures"],
        "bedrooms": json_data["propertyData"]["bedrooms"],
        "bathrooms": json_data["propertyData"]["bathrooms"],
        "tags": json_data["propertyData"]["tags"],
        "tenureType": json_data["propertyData"]["tenure"]["tenureType"],
        "tenureMessage": json_data["propertyData"]["tenure"]["message"],
        "propertyType": json_data["propertyData"]["soldPropertyType"],
        "propertySubType": json_data["propertyData"]["propertySubType"],
        "sharedOwnership": json_data["propertyData"]["sharedOwnership"][
            "sharedOwnership"
        ],
        "councilTaxExempt": json_data["propertyData"]["livingCosts"][
            "councilTaxExempt"
        ],
        "councilTaxIncluded": json_data["propertyData"]["livingCosts"][
            "councilTaxIncluded"
        ],
        "annualServiceCharge": json_data["propertyData"]["livingCosts"][
            "annualServiceCharge"
        ],
        "councilTaxBand": json_data["propertyData"]["livingCosts"]["councilTaxBand"],
        "copyLinkUrl": json_data["metadata"]["copyLinkUrl"],
    }


def convert_base_property_data_to_df(property_data: PropertyData) -> pd.DataFrame:
    filtered_property_data = {
        "id": property_data["id"],
        "published": property_data["published"],
        "archived": property_data["archived"],
        "description": property_data["description"],
        "propertyPhrase": property_data["propertyPhrase"],
        "auctionFeesDisclaimer": property_data["auctionFeesDisclaimer"],
        "guidePriceDisclaimer": property_data["guidePriceDisclaimer"],
        "reservePriceDisclaimer": property_data["reservePriceDisclaimer"],
        "pageTitle": property_data["pageTitle"],
        "primaryPrice": property_data["primaryPrice"],
        "displayPriceQualifier": property_data["displayPriceQualifier"],
        "pricePerSqft": property_data["pricePerSqft"],
        "address": property_data["address"],
        "outcode": property_data["outcode"],
        "incode": property_data["incode"],
        "keyFeatures": " | ".join(property_data["keyFeatures"]),
        "companyName": property_data["companyName"],
        "companyBranch": property_data["companyBranch"],
        "companyBranchDisplayName": property_data["companyBranchDisplayName"],
        "companyIsNewHomeDeveloper": property_data["companyIsNewHomeDeveloper"],
        "companyLocalNumber": property_data["companyLocalNumber"],
        "latitude": property_data["latitude"],
        "longitude": property_data["longitude"],
        "bedrooms": property_data["bedrooms"],
        "bathrooms": property_data["bathrooms"],
        "tags": " | ".join(property_data["tags"]),
        "tenureType": property_data["tenureType"],
        "tenureMessage": property_data["tenureMessage"],
        "propertyType": property_data["propertyType"],
        "propertySubType": property_data["propertySubType"],
        "sharedOwnership": property_data["sharedOwnership"],
        "councilTaxExempt": property_data["councilTaxExempt"],
        "councilTaxIncluded": property_data["councilTaxIncluded"],
        "annualServiceCharge": property_data["annualServiceCharge"],
        "councilTaxBand": property_data["councilTaxBand"],
        "copyLinkUrl": property_data["copyLinkUrl"],
    }
    return pd.DataFrame(filtered_property_data, index=[0])


def convert_nearest_stations_to_df(property_data: PropertyData) -> pd.DataFrame:
    nearest_stations = property_data["nearestStations"]
    property_id = property_data["id"]
    nearest_stations_df = pd.DataFrame(nearest_stations)
    nearest_stations_df = nearest_stations_df.assign(id=property_id)
    return nearest_stations_df


def convert_list_field_to_df(list_field: list, id: str) -> pd.DataFrame:
    list_field_df = pd.DataFrame(list_field)
    list_field_df = list_field_df.assign(id=id)
    return list_field_df


def get_next_weekday(startdate, weekday):
    """
    @startdate: given date, in format '2013-05-25'
    @weekday: week day as a integer, between 0 (Monday) to 6 (Sunday)
    """
    d = datetime.strptime(startdate, "%Y-%m-%d")
    t = timedelta((7 + weekday - d.weekday()) % 7)
    return (d + t).strftime("%Y%m%d")


user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15",
]

user_agent = random.choice(user_agent_list)
headers = {
    "User-Agent": user_agent,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}

web_response = requests.get(args.rightmove_url, headers=headers)
soup = BeautifulSoup(web_response.text, "html.parser")
property_json_string_id = re.compile("PAGE_MODEL")
property_html = soup.find("script", string=property_json_string_id)
re_results = re.search(
    r"<script>\s+window.PAGE_MODEL = (.+)\s+</script>", str(property_html)
)
re_results_group = re_results.group(1) if re_results is not None else ""


property_json = json.loads(re_results_group)
property_data = parse_property_data(property_json)
converted_primary_price = convert_price_to_float(property_data["primaryPrice"])
price_per_sqm = converted_primary_price / args.property_size_sqm
snapshot_date = datetime.today().strftime("%Y-%m-%d")
property_data_df = convert_base_property_data_to_df(property_data)
property_data_df = property_data_df.assign(numericalPrice=converted_primary_price)
property_data_df = property_data_df.assign(property_size_sqm=args.property_size_sqm)
property_data_df = property_data_df.assign(pricePerSqm=price_per_sqm)
property_data_df = property_data_df.assign(location=args.manual_location)
property_data_df = property_data_df.assign(snapshotDate=snapshot_date)

nearest_stations_df = convert_list_field_to_df(
    property_data["nearestStations"], property_data["id"]
)
property_rooms_df = convert_list_field_to_df(
    property_data["rooms"], property_data["id"]
)
brochures_df = convert_list_field_to_df(property_data["brochures"], property_data["id"])
sizings_df = convert_list_field_to_df(property_data["sizings"], property_data["id"])


tfl_api_to_location = "YOUR ARRIVAL POSTCODE GOES HERE"
tfl_api_journey_date = get_next_weekday(datetime.today().strftime("%Y-%m-%d"), 3)
tfl_api_journey_time = "0830"
tfl_api_key = "YOUR API KEY GOES HERE"
tfl_api_from_lat_lon = (
    str(property_data["latitude"]) + "," + str(property_data["longitude"])
)
tfl_api_from_postcode = property_data["outcode"] + property_data["incode"]
tfl_api_headers = {"Cache-Control": "no-cache"}

tfl_api_url = (
    "https://api.tfl.gov.uk/Journey/JourneyResults/"
    + tfl_api_from_lat_lon
    + "/to/"
    + tfl_api_to_location
    + "?nationalSearch=true&date="
    + tfl_api_journey_date
    + "&time="
    + tfl_api_journey_time
    + "&timeIs=Departing&journeyPreference=LeastTime"
    + "&accessibilityPreference=NoRequirements"
    + "&maxTransferMinutes=120&maxWalkingMinutes=120&walkingSpeed=Average"
    + "&cyclePreference=None&bikeProficiency=Easy&api_key="
    + tfl_api_key
)

postcode_api_url = "https://api.postcodes.io/postcodes/" + tfl_api_from_postcode

postcode_location_api_url = (
    "https://api.postcodes.io/postcodes?lon="
    + str(property_data["longitude"])
    + "&lat="
    + str(property_data["latitude"])
)

journey_response = requests.get(tfl_api_url, headers=tfl_api_headers)
journey_json = json.loads(journey_response.text)

postcode_api_response = requests.get(postcode_location_api_url)
postcode_info_json = json.loads(postcode_api_response.text)

if journey_response.status_code == 200:
    journey_duration = journey_json["journeys"][0]["duration"]
else:
    journey_duration = None

if postcode_api_response.status_code == 200:
    postcode_admin_district = postcode_info_json["result"][0]["admin_district"]
    postcode_parish = postcode_info_json["result"][0]["parish"]
    postcode_admin_county = postcode_info_json["result"][0]["admin_county"]
else:
    postcode_admin_district = ""
    postcode_parish = ""
    postcode_admin_county = ""

property_data_df = property_data_df.assign(journey_duration=journey_duration)
property_data_df = property_data_df.assign(postcode_parish=postcode_parish)
property_data_df = property_data_df.assign(
    postcode_admin_district=postcode_admin_district
)
property_data_df = property_data_df.assign(postcode_admin_county=postcode_admin_county)

# --- Saving dataframes to CSV
path = "property_csv"
if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)

property_data_df.to_csv(
    "./property_csv/" + "[" + snapshot_date + "]" + property_data["id"] + "[base].csv"
)

nearest_stations_df.to_csv(
    "./property_csv/"
    + "["
    + snapshot_date
    + "]"
    + property_data["id"]
    + "[stations].csv"
)
property_rooms_df.to_csv(
    "./property_csv/" + "[" + snapshot_date + "]" + property_data["id"] + "[rooms].csv"
)
brochures_df.to_csv(
    "./property_csv/"
    + "["
    + snapshot_date
    + "]"
    + property_data["id"]
    + "[brochures].csv"
)
sizings_df.to_csv(
    "./property_csv/"
    + "["
    + snapshot_date
    + "]"
    + property_data["id"]
    + "[sizings].csv"
)
with open(
    "./property_csv/" + "[" + snapshot_date + "]" + property_data["id"] + "[base].json",
    "w",
) as f:
    f.write(json.dumps(property_data))

print("CSV files written for property " + property_data["id"])