Advanced Geospatial Feature Creation

Demonstrates my expertise in data cleaning and transformation for generating geospatial features. The process involves converting raw GPS data into decimal degrees format and identifying key transportation hubs in Hamburg, Germany.

# imports
import janitor
import re
from pathlib import Path
from subprocess import check_output
import pandas as pd
from shapely.geometry import Point

Advanced Geospatial Feature Creation

Summary

In this article, we start by importing data of all the stations of two main types of public transport found in Hamburg, Germany. They are:

Using this data, the goal is to create the following features that will be added to the list of features for every listing found in the core dataset.

The imported data is transformed and features are extracted using regular expressions and the pandas library, to create tidy columns for the new features. All data is made compatible with the main dataset, by converting the location data of the subway and suburban train stops found in the imported flat files from degrees, minutes, and seconds (DMS) to DD

Once, the new features are created, the tabular data with the new features, then is exported as flat files for integration with the main dataset.

The Steps

The following is a flow chart showing the steps used to create the new features from the imported Wikipedia data.

Figure 1 shows the cleaning steps, to create the new features for each listing. They are: 1. Direct distance from each listing to closest subway station. 2. Direct distance from each listing to closest suburban train station. Distance is calculated by using the Python implementation of the haversine function.

Importing The Flat Files

We create a Path variable data that holds the path to the location where the flat files are found at.

base = check_output(["pwd"]).decode("utf-8")[:-18]
base = Path(base)
data = Path.joinpath(base, "data")

We import two flat files. Each one holds to some degree data that can only be found in it and that is needed to build the final public transport geospatial feature set. c_ubahn holds the data for all subway stops and c_sbahn for all suburban train stops. low_memory=False is added, to make sure that pandas evaluates all rows of the DataFrame during each calculation and transformation and not just a small subset of all rows.

# Reading in the data
c_ubahn = pd.read_csv(data / "ubahn_halte_mit_gps.csv", low_memory=False)
c_sbahn = pd.read_csv(data / "sbahn_halte_mit_gps.csv", low_memory=False)

Initial Cleaning

We want to check the column names for both DataFrames, in order to verify that there is the same data available for both types of public transport. Upon looking at the actual headings of the DataFrames, we see that they need to be cleaned and translated first. Afterwards, the matching headings are given matching column names in both DataFrames.

c_sbahn.columns, c_ubahn.columns
(Index(['Bahnhof (Kürzel)Karte', 'EröffnungS-Bahn', 'Bezirk/Ort (kursiv)',
        'Kat', 'Strecke', 'Linien', 'Umstieg', 'Barriere-freier Zugang',
        'Sehenswürdigkeitenöffentliche Einrichtungen', 'Bild'],
       dtype='object'),
 Index(['Bahnhof (Kürzel)Karte', 'Linie', 'Eröffnung', 'Lage', 'Stadtteil/Ort',
        'Umsteige-möglichkeit', 'Barriere-frei', 'Anmerkungen',
        'Sehenswürdigkeitenöffent. Einrichtungen', 'Bild'],
       dtype='object'))

We clean the headings of the two DataFrames by using the clean_names function from the pyjanitor library.

c_ubahn = c_ubahn.clean_names()
c_sbahn = c_sbahn.clean_names()
c_ubahn.columns, c_sbahn.columns
(Index(['bahnhof_kurzel_karte', 'linie', 'eroffnung', 'lage', 'stadtteil_ort',
        'umsteige_moglichkeit', 'barriere_frei', 'anmerkungen',
        'sehenswurdigkeitenoffent_einrichtungen', 'bild'],
       dtype='object'),
 Index(['bahnhof_kurzel_karte', 'eroffnungs_bahn', 'bezirk_ort_kursiv_', 'kat',
        'strecke', 'linien', 'umstieg', 'barriere_freier_zugang',
        'sehenswurdigkeitenoffentliche_einrichtungen', 'bild'],
       dtype='object'))

Defining A Custom Function For Cleaning

All columns, except ‘bahnhof_kurzel_karte’ are of no interest to us, since they do not contain the GPS information for a given station. This is true for both, the underground and suburban train station CSV data.

Since there are two DataFrames that need the same cleaning steps, we save some time defining a function that we hand a DataFrame as input, and that returns the cleaned DataFrame. The cleaning steps it does are:

def drop_sort(df:pd.DataFrame):

    df = df.rename_column("bahnhof_kurzel_karte", "station")
    df.drop(
        [col for col in df.columns.tolist() if col != "station"], axis=1, inplace=True
    )
    df.sort_values("station", inplace=True)
    return df

Process our two DataFrames and save each returned DataFrame in a new variable.

dfu = drop_sort(c_ubahn)
dfs = drop_sort(c_sbahn)

Check the columns of the newly created DataFrames to verify that only column ‘station’ is still there.

dfu.columns, dfs.columns
(Index(['station'], dtype='object'), Index(['station'], dtype='object'))

Print the first five rows of each DataFrame to get a better understanding of its structure and what regular expression is needed for each, to extract the station name and the GPS coordinates from the ‘station’ column. One can see from the first 5 rows already that two different regular expressions are needed for to extract the values from each one.

for s in dfu, dfs:
    print(f"{s.head(5)}\n\n")
                                            station
0   Ahrensburg Ost (AO)53° 39′ 41″ N, 10° 14′ 36″ O
1  Ahrensburg West (AW)53° 39′ 52″ N, 10° 13′ 12″ O
2        Alsterdorf (AL)53° 36′ 23″ N, 10° 0′ 42″ O
3    Alter Teichweg (AT)53° 35′ 12″ N, 10° 3′ 52″ O
4           Barmbek (BA)53° 35′ 14″ N, 10° 2′ 40″ O


                                             station
0      Agathenburg (AABG)53° 33′ 53″ N, 9° 31′ 48″ O
1        Allermöhe (AALH)53° 29′ 25″ N, 10° 9′ 31″ O
2  Alte Wöhr (Stadtpark) (AAW)53° 35′ 51″ N, 10° ...
3              Altona (AAS)53° 33′ 7″ N, 9° 56′ 4″ O
4         Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O

The Heavy Lifting - Extracting The GPS Coordinates

While the compiled regular expression differs for the two DataFrames dfs and dfu, the steps to extract the name of the station from the ‘station’ column are identical. They are:

bhf = re.compile("(^.+)\s\(\w+\)")
dfs["name"] = dfs["station"].str.extract(bhf, expand=False)
dfs.loc[:, ["station", "name"]]
station name
0 Agathenburg (AABG)53° 33′ 53″ N, 9° 31′ 48″ O Agathenburg
1 Allermöhe (AALH)53° 29′ 25″ N, 10° 9′ 31″ O Allermöhe
2 Alte Wöhr (Stadtpark) (AAW)53° 35′ 51″ N, 10° ... Alte Wöhr (Stadtpark)
3 Altona (AAS)53° 33′ 7″ N, 9° 56′ 4″ O Altona
4 Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O Aumühle
... ... ...
63 Wandsbeker Chaussee (AWCH)Lage Wandsbeker Chaussee
64 Wedel (AWL)53° 34′ 55″ N, 9° 42′ 18″ O Wedel
65 Wellingsbüttel (AWBS)53° 38′ 28″ N, 10° 4′ 57″ O Wellingsbüttel
66 Wilhelmsburg (AWFS)53° 29′ 53″ N, 10° 0′ 24″ O Wilhelmsburg
67 Wohltorf (AWLF)53° 31′ 14″ N, 10° 16′ 40″ O Wohltorf

68 rows × 2 columns

bhf_ubahn = re.compile("(^[-a-zA-Züäöß.ÜÖ,Ä]{0,30}\s?[-a-zA-Züäöß.ÜÖ,Ä]{0,30}[^(\d])")
dfu["name"] = dfu["station"].str.extract(bhf_ubahn, expand=False)
dfu.loc[:, ["station", "name"]]
station name
0 Ahrensburg Ost (AO)53° 39′ 41″ N, 10° 14′ 36″ O Ahrensburg Ost
1 Ahrensburg West (AW)53° 39′ 52″ N, 10° 13′ 12″ O Ahrensburg West
2 Alsterdorf (AL)53° 36′ 23″ N, 10° 0′ 42″ O Alsterdorf
3 Alter Teichweg (AT)53° 35′ 12″ N, 10° 3′ 52″ O Alter Teichweg
4 Barmbek (BA)53° 35′ 14″ N, 10° 2′ 40″ O Barmbek
... ... ...
96 Wandsbek Markt (WM)53° 34′ 19″ N, 10° 4′ 4″ O Wandsbek Markt
95 Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′... Wandsbek-Gartenstadt
97 Wandsbeker Chaussee (WR)53° 34′ 12″ N, 10° 3′ ... Wandsbeker Chaussee
98 Wartenau (WA)53° 33′ 51″ N, 10° 2′ 4″ O Wartenau
92 Überseequartier (UR)53° 32′ 26″ N, 9° 59′ 56″ O Überseequartier

99 rows × 2 columns

Extract Entire Coordinate Pairs

Given that the coordinates have format DMS in the input files, regular expressions are used to extract the entire coordinate pair for each station. After looking at the value range for the minute component across all rows, there can only be one value for the minute component that is 53. The pattern matches and captures everything until the last capital O, which there is only one, which marks the end of one complete coordinate pair.

# five = re.compile("([4-6][1-6].+O)")
five = re.compile("(53.+O)")
dfu["gps"] = dfu["station"].str.extract(five, expand=False)
dfs["gps"] = dfs["station"].str.extract(five, expand=False)

Check For Missing Values

For the subway stations, there are no missing values in the GPS column. Which means that the capture group extracted something for all rows.

dfu[dfu.gps.isna() == True]
station name gps

While there are two missing values found in the table for the suburban trains, it shows that these missing values were not introduced as a result of the regular expressions pattern not matching the coordinates for two rows in the DataFrame. We can see that there are no coordinate values found in the ‘station’ column and that is why rows 25 and 63 in the GPS column have missing values. This will be fixed in the next step.

dfs[dfs.gps.isna() == True]
station name gps
25 Hauptbahnhof (AHS)Lage Hauptbahnhof NaN
63 Wandsbeker Chaussee (AWCH)Lage Wandsbeker Chaussee NaN

Fill Missing Values

The missing values are for stations at ‘Hauptbahnhof’ and ‘Wandsbeker Chaussee’. We look for other stations at these locations that have GPS values. Their GPS coordinates should be close to those of the missing stations.

No other entries, apart from the two rows with missing GPS coordinates are found in the suburban train stations DataFrame. Next, we try to find rows in the subway stations DataFrame that have valid GPS coordinates for the two stations.

dfs[dfs["station"].str.contains(r"Haupt|Wandsbek")]
station name gps
25 Hauptbahnhof (AHS)Lage Hauptbahnhof NaN
63 Wandsbeker Chaussee (AWCH)Lage Wandsbeker Chaussee NaN

Valid values for both stations are found in the subways stations DataFrame. Row indexes of these rows are 34 for ‘Hauptbahnhof’ and 97 for ‘Wandsbeker Chaussee’. These indexes are used to fill the missing values.

dfu[dfu["station"].str.contains("Haupt|Wandsbek")]
station name gps
34 Hauptbahnhof Nord (HX)53° 33′ 15″ N, 10° 0′ 25″ O Hauptbahnhof Nord 53° 33′ 15″ N, 10° 0′ 25″ O
35 Hauptbahnhof Süd (HB)53° 33′ 8″ N, 10° 0′ 30″ O Hauptbahnhof Süd 53° 33′ 8″ N, 10° 0′ 30″ O
96 Wandsbek Markt (WM)53° 34′ 19″ N, 10° 4′ 4″ O Wandsbek Markt 53° 34′ 19″ N, 10° 4′ 4″ O
95 Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′... Wandsbek-Gartenstadt 53° 35′ 32″ N, 10° 4′ 27″ O
97 Wandsbeker Chaussee (WR)53° 34′ 12″ N, 10° 3′ ... Wandsbeker Chaussee 53° 34′ 12″ N, 10° 3′ 31″ O

Using the row indexes in the subway stations DataFrame from above, the missing values in the suburban train stations DataFrame are filled manually by overwriting their values in the gps column.

dfs.loc[25, "gps"] = dfu.loc[34, "gps"]
dfs.loc[63, "gps"] = dfu.loc[97, "gps"]

Splitting on ‘,’, we can create two new columns from the values in the gps column. The new columns will hold the latitude values and the longitude values respectively for each station.

dfu[["lat", "lng"]] = dfu["gps"].str.split(",", expand=True)
dfs[["lat", "lng"]] = dfs["gps"].str.split(",", expand=True)

The output below verifies for the first five entries of the lat and lng columns that each coordinate pair has structure (Latitude, Longitude) coordinate values.

dfu[['lat','lng']].head()
lat lng
0 53° 39′ 41″ N 10° 14′ 36″ O
1 53° 39′ 52″ N 10° 13′ 12″ O
2 53° 36′ 23″ N 10° 0′ 42″ O
3 53° 35′ 12″ N 10° 3′ 52″ O
4 53° 35′ 14″ N 10° 2′ 40″ O

All lines of the lat and lng columns should have the correct value pairs that is lat values should always start with digits 53 and end with a capital N, while values in the lng column should either start with digits 10 or digit 9 and always end with a capital O.

The actual verification is done, once the values are converted to DD format, since the effort to verify the values is much smaller compared to doing it in the current state, before the conversion.

Before defining the function to convert the GPS coordinates from DMS to DD format, we look at one row in the latitude and longitude columns of both DataFrames. The structure of the values looks identically in both that means that there should be no need to define two separate conversion functions, one should suffice.

def prev(df: pd.DataFrame):
    print(
        f"latitude component looks like: {df.lat[49]},\n\nlongitude component looks like {df.lng[49]}"
    )
prev(dfu)
latitude component looks like: 53° 39′ 39″ N,

longitude component looks like  10° 1′ 3″ O
prev(dfs)
latitude component looks like: 53° 39′ 7″ N,

longitude component looks like  10° 5′ 38″ O

In order to convert the GPS values from DMS to DD, we define functions that will do the conversion for us. We test the output of the function to make sure the output is as expected.

def dms2dd(degrees, minutes, seconds, direction):
    dd = float(degrees) + float(minutes) / 60 + float(seconds) / (60 * 60)
    if direction == "E":
        dd *= -1
    return dd

def parse_dms(dms):
    parts = re.split("[^\d\w\.]+", dms)
    lat = dms2dd(parts[0], parts[1], parts[2], parts[3])
    return lat

dd = parse_dms("78°55'44.33324\" N")
print(dd)
78.92898145555556
print(dfs.lat[4])
print(f"{parse_dms(dfs.lat[4])}")
53° 31′ 48″ N
53.53

All rows in the lat and lng columns are stripped of any remaining whitespace, if there is any to be found. This is done prior to the parse_dms function being applied, since the regex pattern might fail to match all relevant parts of the coordinates, if there was any whitespace to be found that was not accounted for.

dfu["lng"] = dfu["lng"].str.strip()
dfu["lat"] = dfu["lat"].str.strip()
dfs["lng"] = dfs["lng"].str.strip()
dfs["lat"] = dfs["lat"].str.strip()

The custom conversion function parse_dms is applied to all rows in the lat and lng columns of both DataFrames respectively and the results are saved in new columns that share the same suffix _dd, alias for decimal degrees format. The reason for creating these new columns, is that we want to be able to compare the values before and after the conversion in each row of the DataFrames.

dfu["lat_dd"] = dfu["lat"].apply(lambda x: parse_dms(x))
dfu["lng_dd"] = dfu["lng"].apply(lambda x: parse_dms(x))
dfs["lat_dd"] = dfs["lat"].apply(lambda x: parse_dms(x))
dfs["lng_dd"] = dfs["lng"].apply(lambda x: parse_dms(x))

Final Verification Of The GPS Coordinates

In order to verify that the coordinates are valid for all rows of the lat_dd and lng_dd columns in both DataFrames, a custom function is used to check both DataFrames.

The function takes a pandas DataFrame as input. It compiles two regular expression pattern at runtime. The patterns both match from the start of each entry till the end of it. They match all valid digits that can occur before the decimal seperator and match any digit, any number of times after the decimal seperator, until the end of each entry. Using $ makes sure that only digits are found in each row until the end of each value.

def vf(df:pd.DataFrame):
    lat_pat=re.compile('^53\.\d+$')
    lng_pat=re.compile('^1?[09]\.\d+$')
    for r in df[['lat_dd','lng_dd']].iterrows():
        m1,m2=lat_pat.match(str(r[1][0])), lng_pat.match(str(r[1][1]))
        assert m1.end() == len(str(r[1][0]))
        assert m2.end() == len(str(r[1][1]))

Both DataFrames pass the verification for all rows. See below:

vf(dfu)
vf(dfs)

Creating The Final GPS Column

Using the assign function from the pandas library, a new column gps_dd is created. Its values are tuples of GPS coordinates for each subway and suburban train station found in the data. This step is needed in order to apply the Point conversion from the shapely.geometry module. This conversion makes it possible to integrate the features created here into the listings data in the core dataset. It enables us to compute distances between listings and any station found in the two datasets processed here. The ability to compute distances between features gives a powerful tool that can enhance the results we get from interpreting a tree based model in the later stages of this project for example.

dfs = dfs.assign(
    gps_dd=list(dfs[["lat_dd", "lng_dd"]].itertuples(index=False, name=None))
)
dfu = dfu.assign(
    gps_dd=list(dfu[["lat_dd", "lng_dd"]].itertuples(index=False, name=None))
)

Applying the Point conversion can throw a lot of errors that would require us to look at all the previous steps and carefully apply the transformations again possibly. The documentation of the shapely library has been the best source of knowledge for me to find answers to problems encountered when trying to apply the Point conversion.

dfs["gps_dd"] = dfs["gps_dd"].apply(Point)
dfu["gps_dd"] = dfu["gps_dd"].apply(Point)

Final Glimpse At The Tables

This is how the tables look like, in their exported form. The intermediate columns are kept, for better interpretability of the data, should any problems arise later down the line. Pandas understands the data, which was converted to Point and treat it as a column of type shapely.geometry.Point.

def s(df,seed=42):
       return df.sample(n=8,random_state=seed)

s(dfu)
station name gps lat lng lat_dd lng_dd gps_dd
61 Mümmelmannsberg (MG)53° 31′ 41″ N, 10° 8′ 53″ O Mümmelmannsberg 53° 31′ 41″ N, 10° 8′ 53″ O 53° 31′ 41″ N 10° 8′ 53″ O 53.528056 10.148056 POINT (53.528055555555554 10.148055555555555)
40 Hudtwalckerstraße (HU)53° 35′ 40″ N, 9° 59′ 46″ O Hudtwalckerstraße 53° 35′ 40″ N, 9° 59′ 46″ O 53° 35′ 40″ N 9° 59′ 46″ O 53.594444 9.996111 POINT (53.59444444444445 9.99611111111111)
95 Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′... Wandsbek-Gartenstadt 53° 35′ 32″ N, 10° 4′ 27″ O 53° 35′ 32″ N 10° 4′ 27″ O 53.592222 10.074167 POINT (53.592222222222226 10.074166666666667)
18 Emilienstraße (EM)53° 34′ 18″ N, 9° 57′ 9″ O Emilienstraße 53° 34′ 18″ N, 9° 57′ 9″ O 53° 34′ 18″ N 9° 57′ 9″ O 53.571667 9.952500 POINT (53.57166666666667 9.952499999999999)
98 Wartenau (WA)53° 33′ 51″ N, 10° 2′ 4″ O Wartenau 53° 33′ 51″ N, 10° 2′ 4″ O 53° 33′ 51″ N 10° 2′ 4″ O 53.564167 10.034444 POINT (53.564166666666665 10.034444444444444)
84 St. Pauli (PA)53° 33′ 2″ N, 9° 58′ 12″ O St. Pauli 53° 33′ 2″ N, 9° 58′ 12″ O 53° 33′ 2″ N 9° 58′ 12″ O 53.550556 9.970000 POINT (53.550555555555555 9.97)
64 Niendorf Nord (NN)53° 38′ 26″ N, 9° 57′ 1″ O Niendorf Nord 53° 38′ 26″ N, 9° 57′ 1″ O 53° 38′ 26″ N 9° 57′ 1″ O 53.640556 9.950278 POINT (53.64055555555556 9.950277777777776)
42 Jungfernstieg (JG)53° 33′ 8″ N, 9° 59′ 37″ O Jungfernstieg 53° 33′ 8″ N, 9° 59′ 37″ O 53° 33′ 8″ N 9° 59′ 37″ O 53.552222 9.993611 POINT (53.55222222222222 9.993611111111111)
s(dfs)
station name gps lat lng lat_dd lng_dd gps_dd
46 Ohlsdorf (AOPS)53° 37′ 14″ N, 10° 1′ 54″ O Ohlsdorf 53° 37′ 14″ N, 10° 1′ 54″ O 53° 37′ 14″ N 10° 1′ 54″ O 53.620556 10.031667 POINT (53.620555555555555 10.031666666666668)
15 Elbgaustraße (AEGS)53° 36′ 10″ N, 9° 53′ 35″ O Elbgaustraße 53° 36′ 10″ N, 9° 53′ 35″ O 53° 36′ 10″ N 9° 53′ 35″ O 53.602778 9.893056 POINT (53.60277777777778 9.893055555555556)
4 Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O Aumühle 53° 31′ 48″ N, 10° 18′ 53″ O 53° 31′ 48″ N 10° 18′ 53″ O 53.530000 10.314722 POINT (53.53 10.314722222222223)
9 Billwerder-Moorfleet (ABWM)53° 30′ 57″ N, 10° ... Billwerder-Moorfleet 53° 30′ 57″ N, 10° 5′ 47″ O 53° 30′ 57″ N 10° 5′ 47″ O 53.515833 10.096389 POINT (53.51583333333333 10.096388888888889)
28 Hoheneichen (AHCH)53° 38′ 8″ N, 10° 4′ 5″ O Hoheneichen 53° 38′ 8″ N, 10° 4′ 5″ O 53° 38′ 8″ N 10° 4′ 5″ O 53.635556 10.068056 POINT (53.635555555555555 10.068055555555555)
41 Nettelnburg (ANTB)53° 29′ 16″ N, 10° 10′ 52″ O Nettelnburg 53° 29′ 16″ N, 10° 10′ 52″ O 53° 29′ 16″ N 10° 10′ 52″ O 53.487778 10.181111 POINT (53.48777777777778 10.181111111111111)
58 Sternschanze (ASST)53° 33′ 49″ N, 9° 57′ 57″ O Sternschanze 53° 33′ 49″ N, 9° 57′ 57″ O 53° 33′ 49″ N 9° 57′ 57″ O 53.563611 9.965833 POINT (53.56361111111111 9.965833333333332)
5 Bahrenfeld (ABAF)53° 33′ 36″ N, 9° 54′ 38″ O Bahrenfeld 53° 33′ 36″ N, 9° 54′ 38″ O 53° 33′ 36″ N 9° 54′ 38″ O 53.560000 9.910556 POINT (53.559999999999995 9.910555555555556)

Export The Cleaned Station Data

The two tables are ready to be integrated into to main dataset, and are therefore exported as csv files.

dfs.to_csv("../data/s-bahn_final.csv")
dfu.to_csv("../data/u-bahn_final.csv")

These are the steps for creating new geospatial features that are completely independent of the ones in the core dataset scraped from www.immobilienscout24.de . In the following steps, these features will be integrated into the core dataset and used to create new features for each listing found in the core dataset.