Demonstrates my expertise in data cleaning and transformation for generating geospatial features. The process involves converting raw GPS data into decimal degrees format and identifying key transportation hubs in Hamburg, Germany.
# imports
import janitor
import re
from pathlib import Path
from subprocess import check_output
import pandas as pd
from shapely.geometry import Point
In this article, we start by importing data of all the stations of two main types of public transport found in Hamburg, Germany. They are:
Using this data, the goal is to create the following features that will be added to the list of features for every listing found in the core dataset.
The imported data is transformed and features are extracted using regular expressions and the pandas library, to create tidy columns for the new features. All data is made compatible with the main dataset, by converting the location data of the subway and suburban train stops found in the imported flat files from degrees, minutes, and seconds (DMS) to DD
Once, the new features are created, the tabular data with the new features, then is exported as flat files for integration with the main dataset.
The following is a flow chart showing the steps used to create the new features from the imported Wikipedia data.
We create a Path
variable data
that holds the path to the location where the flat files are found at.
base = check_output(["pwd"]).decode("utf-8")[:-18]
base = Path(base)
data = Path.joinpath(base, "data")
We import two flat files. Each one holds to some degree data that can only be found in it and that is needed to build the final public transport geospatial feature set. c_ubahn
holds the data for all subway stops and c_sbahn
for all suburban train stops. low_memory=False
is added, to make sure that pandas evaluates all rows of the DataFrame during each calculation and transformation and not just a small subset of all rows.
# Reading in the data
c_ubahn = pd.read_csv(data / "ubahn_halte_mit_gps.csv", low_memory=False)
c_sbahn = pd.read_csv(data / "sbahn_halte_mit_gps.csv", low_memory=False)
We want to check the column names for both DataFrames, in order to verify that there is the same data available for both types of public transport. Upon looking at the actual headings of the DataFrames, we see that they need to be cleaned and translated first. Afterwards, the matching headings are given matching column names in both DataFrames.
c_sbahn.columns, c_ubahn.columns
(Index(['Bahnhof (Kürzel)Karte', 'EröffnungS-Bahn', 'Bezirk/Ort (kursiv)',
'Kat', 'Strecke', 'Linien', 'Umstieg', 'Barriere-freier Zugang',
'Sehenswürdigkeitenöffentliche Einrichtungen', 'Bild'],
dtype='object'),
Index(['Bahnhof (Kürzel)Karte', 'Linie', 'Eröffnung', 'Lage', 'Stadtteil/Ort',
'Umsteige-möglichkeit', 'Barriere-frei', 'Anmerkungen',
'Sehenswürdigkeitenöffent. Einrichtungen', 'Bild'],
dtype='object'))
We clean the headings of the two DataFrames by using the clean_names
function from the pyjanitor
library.
c_ubahn = c_ubahn.clean_names()
c_sbahn = c_sbahn.clean_names()
c_ubahn.columns, c_sbahn.columns
(Index(['bahnhof_kurzel_karte', 'linie', 'eroffnung', 'lage', 'stadtteil_ort',
'umsteige_moglichkeit', 'barriere_frei', 'anmerkungen',
'sehenswurdigkeitenoffent_einrichtungen', 'bild'],
dtype='object'),
Index(['bahnhof_kurzel_karte', 'eroffnungs_bahn', 'bezirk_ort_kursiv_', 'kat',
'strecke', 'linien', 'umstieg', 'barriere_freier_zugang',
'sehenswurdigkeitenoffentliche_einrichtungen', 'bild'],
dtype='object'))
All columns, except ‘bahnhof_kurzel_karte’ are of no interest to us, since they do not contain the GPS information for a given station. This is true for both, the underground and suburban train station CSV data.
Since there are two DataFrames that need the same cleaning steps, we save some time defining a function that we hand a DataFrame as input, and that returns the cleaned DataFrame. The cleaning steps it does are:
def drop_sort(df:pd.DataFrame):
df = df.rename_column("bahnhof_kurzel_karte", "station")
df.drop(
[col for col in df.columns.tolist() if col != "station"], axis=1, inplace=True
)
df.sort_values("station", inplace=True)
return df
Process our two DataFrames and save each returned DataFrame in a new variable.
dfu = drop_sort(c_ubahn)
dfs = drop_sort(c_sbahn)
Check the columns of the newly created DataFrames to verify that only column ‘station’ is still there.
dfu.columns, dfs.columns
(Index(['station'], dtype='object'), Index(['station'], dtype='object'))
Print the first five rows of each DataFrame to get a better understanding of its structure and what regular expression is needed for each, to extract the station name and the GPS coordinates from the ‘station’ column. One can see from the first 5 rows already that two different regular expressions are needed for to extract the values from each one.
for s in dfu, dfs:
print(f"{s.head(5)}\n\n")
station
0 Ahrensburg Ost (AO)53° 39′ 41″ N, 10° 14′ 36″ O
1 Ahrensburg West (AW)53° 39′ 52″ N, 10° 13′ 12″ O
2 Alsterdorf (AL)53° 36′ 23″ N, 10° 0′ 42″ O
3 Alter Teichweg (AT)53° 35′ 12″ N, 10° 3′ 52″ O
4 Barmbek (BA)53° 35′ 14″ N, 10° 2′ 40″ O
station
0 Agathenburg (AABG)53° 33′ 53″ N, 9° 31′ 48″ O
1 Allermöhe (AALH)53° 29′ 25″ N, 10° 9′ 31″ O
2 Alte Wöhr (Stadtpark) (AAW)53° 35′ 51″ N, 10° ...
3 Altona (AAS)53° 33′ 7″ N, 9° 56′ 4″ O
4 Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O
While the compiled regular expression differs for the two DataFrames dfs
and dfu
, the steps to extract the name of the station from the ‘station’ column are identical. They are:
str.extract
function.
bhf = re.compile("(^.+)\s\(\w+\)")
dfs["name"] = dfs["station"].str.extract(bhf, expand=False)
dfs.loc[:, ["station", "name"]]
station | name | |
---|---|---|
0 | Agathenburg (AABG)53° 33′ 53″ N, 9° 31′ 48″ O | Agathenburg |
1 | Allermöhe (AALH)53° 29′ 25″ N, 10° 9′ 31″ O | Allermöhe |
2 | Alte Wöhr (Stadtpark) (AAW)53° 35′ 51″ N, 10° ... | Alte Wöhr (Stadtpark) |
3 | Altona (AAS)53° 33′ 7″ N, 9° 56′ 4″ O | Altona |
4 | Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O | Aumühle |
... | ... | ... |
63 | Wandsbeker Chaussee (AWCH)Lage | Wandsbeker Chaussee |
64 | Wedel (AWL)53° 34′ 55″ N, 9° 42′ 18″ O | Wedel |
65 | Wellingsbüttel (AWBS)53° 38′ 28″ N, 10° 4′ 57″ O | Wellingsbüttel |
66 | Wilhelmsburg (AWFS)53° 29′ 53″ N, 10° 0′ 24″ O | Wilhelmsburg |
67 | Wohltorf (AWLF)53° 31′ 14″ N, 10° 16′ 40″ O | Wohltorf |
68 rows × 2 columns
bhf_ubahn = re.compile("(^[-a-zA-Züäöß.ÜÖ,Ä]{0,30}\s?[-a-zA-Züäöß.ÜÖ,Ä]{0,30}[^(\d])")
dfu["name"] = dfu["station"].str.extract(bhf_ubahn, expand=False)
dfu.loc[:, ["station", "name"]]
station | name | |
---|---|---|
0 | Ahrensburg Ost (AO)53° 39′ 41″ N, 10° 14′ 36″ O | Ahrensburg Ost |
1 | Ahrensburg West (AW)53° 39′ 52″ N, 10° 13′ 12″ O | Ahrensburg West |
2 | Alsterdorf (AL)53° 36′ 23″ N, 10° 0′ 42″ O | Alsterdorf |
3 | Alter Teichweg (AT)53° 35′ 12″ N, 10° 3′ 52″ O | Alter Teichweg |
4 | Barmbek (BA)53° 35′ 14″ N, 10° 2′ 40″ O | Barmbek |
... | ... | ... |
96 | Wandsbek Markt (WM)53° 34′ 19″ N, 10° 4′ 4″ O | Wandsbek Markt |
95 | Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′... | Wandsbek-Gartenstadt |
97 | Wandsbeker Chaussee (WR)53° 34′ 12″ N, 10° 3′ ... | Wandsbeker Chaussee |
98 | Wartenau (WA)53° 33′ 51″ N, 10° 2′ 4″ O | Wartenau |
92 | Überseequartier (UR)53° 32′ 26″ N, 9° 59′ 56″ O | Überseequartier |
99 rows × 2 columns
Given that the coordinates have format DMS in the input files, regular expressions are used to extract the entire coordinate pair for each station. After looking at the value range for the minute component across all rows, there can only be one value for the minute component that is 53. The pattern matches and captures everything until the last capital O, which there is only one, which marks the end of one complete coordinate pair.
# five = re.compile("([4-6][1-6].+O)")
five = re.compile("(53.+O)")
dfu["gps"] = dfu["station"].str.extract(five, expand=False)
dfs["gps"] = dfs["station"].str.extract(five, expand=False)
For the subway stations, there are no missing values in the GPS column. Which means that the capture group extracted something for all rows.
dfu[dfu.gps.isna() == True]
station | name | gps |
---|
While there are two missing values found in the table for the suburban trains, it shows that these missing values were not introduced as a result of the regular expressions pattern not matching the coordinates for two rows in the DataFrame. We can see that there are no coordinate values found in the ‘station’ column and that is why rows 25 and 63 in the GPS column have missing values. This will be fixed in the next step.
dfs[dfs.gps.isna() == True]
station | name | gps | |
---|---|---|---|
25 | Hauptbahnhof (AHS)Lage | Hauptbahnhof | NaN |
63 | Wandsbeker Chaussee (AWCH)Lage | Wandsbeker Chaussee | NaN |
The missing values are for stations at ‘Hauptbahnhof’ and ‘Wandsbeker Chaussee’. We look for other stations at these locations that have GPS values. Their GPS coordinates should be close to those of the missing stations.
No other entries, apart from the two rows with missing GPS coordinates are found in the suburban train stations DataFrame. Next, we try to find rows in the subway stations DataFrame that have valid GPS coordinates for the two stations.
dfs[dfs["station"].str.contains(r"Haupt|Wandsbek")]
station | name | gps | |
---|---|---|---|
25 | Hauptbahnhof (AHS)Lage | Hauptbahnhof | NaN |
63 | Wandsbeker Chaussee (AWCH)Lage | Wandsbeker Chaussee | NaN |
Valid values for both stations are found in the subways stations DataFrame. Row indexes of these rows are 34 for ‘Hauptbahnhof’ and 97 for ‘Wandsbeker Chaussee’. These indexes are used to fill the missing values.
dfu[dfu["station"].str.contains("Haupt|Wandsbek")]
station | name | gps | |
---|---|---|---|
34 | Hauptbahnhof Nord (HX)53° 33′ 15″ N, 10° 0′ 25″ O | Hauptbahnhof Nord | 53° 33′ 15″ N, 10° 0′ 25″ O |
35 | Hauptbahnhof Süd (HB)53° 33′ 8″ N, 10° 0′ 30″ O | Hauptbahnhof Süd | 53° 33′ 8″ N, 10° 0′ 30″ O |
96 | Wandsbek Markt (WM)53° 34′ 19″ N, 10° 4′ 4″ O | Wandsbek Markt | 53° 34′ 19″ N, 10° 4′ 4″ O |
95 | Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′... | Wandsbek-Gartenstadt | 53° 35′ 32″ N, 10° 4′ 27″ O |
97 | Wandsbeker Chaussee (WR)53° 34′ 12″ N, 10° 3′ ... | Wandsbeker Chaussee | 53° 34′ 12″ N, 10° 3′ 31″ O |
Using the row indexes in the subway stations DataFrame from above, the missing values in the suburban train stations DataFrame are filled manually by overwriting their values in the gps
column.
dfs.loc[25, "gps"] = dfu.loc[34, "gps"]
dfs.loc[63, "gps"] = dfu.loc[97, "gps"]
Splitting on ‘,’, we can create two new columns from the values in the gps
column. The new columns will hold the latitude values and the longitude values respectively for each station.
dfu[["lat", "lng"]] = dfu["gps"].str.split(",", expand=True)
dfs[["lat", "lng"]] = dfs["gps"].str.split(",", expand=True)
The output below verifies for the first five entries of the lat
and lng
columns that each coordinate pair has structure (Latitude, Longitude) coordinate values.
dfu[['lat','lng']].head()
lat | lng | |
---|---|---|
0 | 53° 39′ 41″ N | 10° 14′ 36″ O |
1 | 53° 39′ 52″ N | 10° 13′ 12″ O |
2 | 53° 36′ 23″ N | 10° 0′ 42″ O |
3 | 53° 35′ 12″ N | 10° 3′ 52″ O |
4 | 53° 35′ 14″ N | 10° 2′ 40″ O |
All lines of the lat
and lng
columns should have the correct value pairs that is lat
values should always start with digits 53 and end with a capital N, while values in the lng
column should either start with digits 10 or digit 9 and always end with a capital O.
The actual verification is done, once the values are converted to DD format, since the effort to verify the values is much smaller compared to doing it in the current state, before the conversion.
Before defining the function to convert the GPS coordinates from DMS to DD format, we look at one row in the latitude and longitude columns of both DataFrames. The structure of the values looks identically in both that means that there should be no need to define two separate conversion functions, one should suffice.
def prev(df: pd.DataFrame):
print(
f"latitude component looks like: {df.lat[49]},\n\nlongitude component looks like {df.lng[49]}"
)
prev(dfu)
latitude component looks like: 53° 39′ 39″ N,
longitude component looks like 10° 1′ 3″ O
prev(dfs)
latitude component looks like: 53° 39′ 7″ N,
longitude component looks like 10° 5′ 38″ O
In order to convert the GPS values from DMS to DD, we define functions that will do the conversion for us. We test the output of the function to make sure the output is as expected.
def dms2dd(degrees, minutes, seconds, direction):
dd = float(degrees) + float(minutes) / 60 + float(seconds) / (60 * 60)
if direction == "E":
dd *= -1
return dd
def parse_dms(dms):
parts = re.split("[^\d\w\.]+", dms)
lat = dms2dd(parts[0], parts[1], parts[2], parts[3])
return lat
dd = parse_dms("78°55'44.33324\" N")
print(dd)
78.92898145555556
print(dfs.lat[4])
print(f"{parse_dms(dfs.lat[4])}")
53° 31′ 48″ N
53.53
All rows in the lat
and lng
columns are stripped of any remaining whitespace, if there is any to be found. This is done prior to the parse_dms
function being applied, since the regex pattern might fail to match all relevant parts of the coordinates, if there was any whitespace to be found that was not accounted for.
dfu["lng"] = dfu["lng"].str.strip()
dfu["lat"] = dfu["lat"].str.strip()
dfs["lng"] = dfs["lng"].str.strip()
dfs["lat"] = dfs["lat"].str.strip()
The custom conversion function parse_dms
is applied to all rows in the lat
and lng
columns of both DataFrames respectively and the results are saved in new columns that share the same suffix _dd
, alias for decimal degrees format. The reason for creating these new columns, is that we want to be able to compare the values before and after the conversion in each row of the DataFrames.
dfu["lat_dd"] = dfu["lat"].apply(lambda x: parse_dms(x))
dfu["lng_dd"] = dfu["lng"].apply(lambda x: parse_dms(x))
dfs["lat_dd"] = dfs["lat"].apply(lambda x: parse_dms(x))
dfs["lng_dd"] = dfs["lng"].apply(lambda x: parse_dms(x))
In order to verify that the coordinates are valid for all rows of the lat_dd
and lng_dd
columns in both DataFrames, a custom function is used to check both DataFrames.
The function takes a pandas DataFrame as input. It compiles two regular expression pattern at runtime. The patterns both match from the start of each entry till the end of it. They match all valid digits that can occur before the decimal seperator and match any digit, any number of times after the decimal seperator, until the end of each entry. Using $
makes sure that only digits are found in each row until the end of each value.
def vf(df:pd.DataFrame):
lat_pat=re.compile('^53\.\d+$')
lng_pat=re.compile('^1?[09]\.\d+$')
for r in df[['lat_dd','lng_dd']].iterrows():
m1,m2=lat_pat.match(str(r[1][0])), lng_pat.match(str(r[1][1]))
assert m1.end() == len(str(r[1][0]))
assert m2.end() == len(str(r[1][1]))
Both DataFrames pass the verification for all rows. See below:
vf(dfu)
vf(dfs)
Using the assign
function from the pandas library, a new column gps_dd
is created. Its values are tuples of GPS coordinates for each subway and suburban train station found in the data. This step is needed in order to apply the Point
conversion from the shapely.geometry
module. This conversion makes it possible to integrate the features created here into the listings data in the core dataset. It enables us to compute distances between listings and any station found in the two datasets processed here. The ability to compute distances between features gives a powerful tool that can enhance the results we get from interpreting a tree based model in the later stages of this project for example.
dfs = dfs.assign(
gps_dd=list(dfs[["lat_dd", "lng_dd"]].itertuples(index=False, name=None))
)
dfu = dfu.assign(
gps_dd=list(dfu[["lat_dd", "lng_dd"]].itertuples(index=False, name=None))
)
Applying the Point
conversion can throw a lot of errors that would require us to look at all the previous steps and carefully apply the transformations again possibly. The documentation of the shapely
library has been the best source of knowledge for me to find answers to problems encountered when trying to apply the Point
conversion.
dfs["gps_dd"] = dfs["gps_dd"].apply(Point)
dfu["gps_dd"] = dfu["gps_dd"].apply(Point)
This is how the tables look like, in their exported form. The intermediate columns are kept, for better interpretability of the data, should any problems arise later down the line. Pandas understands the data, which was converted to Point and treat it as a column of type shapely.geometry.Point
.
def s(df,seed=42):
return df.sample(n=8,random_state=seed)
s(dfu)
station | name | gps | lat | lng | lat_dd | lng_dd | gps_dd | |
---|---|---|---|---|---|---|---|---|
61 | Mümmelmannsberg (MG)53° 31′ 41″ N, 10° 8′ 53″ O | Mümmelmannsberg | 53° 31′ 41″ N, 10° 8′ 53″ O | 53° 31′ 41″ N | 10° 8′ 53″ O | 53.528056 | 10.148056 | POINT (53.528055555555554 10.148055555555555) |
40 | Hudtwalckerstraße (HU)53° 35′ 40″ N, 9° 59′ 46″ O | Hudtwalckerstraße | 53° 35′ 40″ N, 9° 59′ 46″ O | 53° 35′ 40″ N | 9° 59′ 46″ O | 53.594444 | 9.996111 | POINT (53.59444444444445 9.99611111111111) |
95 | Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′... | Wandsbek-Gartenstadt | 53° 35′ 32″ N, 10° 4′ 27″ O | 53° 35′ 32″ N | 10° 4′ 27″ O | 53.592222 | 10.074167 | POINT (53.592222222222226 10.074166666666667) |
18 | Emilienstraße (EM)53° 34′ 18″ N, 9° 57′ 9″ O | Emilienstraße | 53° 34′ 18″ N, 9° 57′ 9″ O | 53° 34′ 18″ N | 9° 57′ 9″ O | 53.571667 | 9.952500 | POINT (53.57166666666667 9.952499999999999) |
98 | Wartenau (WA)53° 33′ 51″ N, 10° 2′ 4″ O | Wartenau | 53° 33′ 51″ N, 10° 2′ 4″ O | 53° 33′ 51″ N | 10° 2′ 4″ O | 53.564167 | 10.034444 | POINT (53.564166666666665 10.034444444444444) |
84 | St. Pauli (PA)53° 33′ 2″ N, 9° 58′ 12″ O | St. Pauli | 53° 33′ 2″ N, 9° 58′ 12″ O | 53° 33′ 2″ N | 9° 58′ 12″ O | 53.550556 | 9.970000 | POINT (53.550555555555555 9.97) |
64 | Niendorf Nord (NN)53° 38′ 26″ N, 9° 57′ 1″ O | Niendorf Nord | 53° 38′ 26″ N, 9° 57′ 1″ O | 53° 38′ 26″ N | 9° 57′ 1″ O | 53.640556 | 9.950278 | POINT (53.64055555555556 9.950277777777776) |
42 | Jungfernstieg (JG)53° 33′ 8″ N, 9° 59′ 37″ O | Jungfernstieg | 53° 33′ 8″ N, 9° 59′ 37″ O | 53° 33′ 8″ N | 9° 59′ 37″ O | 53.552222 | 9.993611 | POINT (53.55222222222222 9.993611111111111) |
s(dfs)
station | name | gps | lat | lng | lat_dd | lng_dd | gps_dd | |
---|---|---|---|---|---|---|---|---|
46 | Ohlsdorf (AOPS)53° 37′ 14″ N, 10° 1′ 54″ O | Ohlsdorf | 53° 37′ 14″ N, 10° 1′ 54″ O | 53° 37′ 14″ N | 10° 1′ 54″ O | 53.620556 | 10.031667 | POINT (53.620555555555555 10.031666666666668) |
15 | Elbgaustraße (AEGS)53° 36′ 10″ N, 9° 53′ 35″ O | Elbgaustraße | 53° 36′ 10″ N, 9° 53′ 35″ O | 53° 36′ 10″ N | 9° 53′ 35″ O | 53.602778 | 9.893056 | POINT (53.60277777777778 9.893055555555556) |
4 | Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O | Aumühle | 53° 31′ 48″ N, 10° 18′ 53″ O | 53° 31′ 48″ N | 10° 18′ 53″ O | 53.530000 | 10.314722 | POINT (53.53 10.314722222222223) |
9 | Billwerder-Moorfleet (ABWM)53° 30′ 57″ N, 10° ... | Billwerder-Moorfleet | 53° 30′ 57″ N, 10° 5′ 47″ O | 53° 30′ 57″ N | 10° 5′ 47″ O | 53.515833 | 10.096389 | POINT (53.51583333333333 10.096388888888889) |
28 | Hoheneichen (AHCH)53° 38′ 8″ N, 10° 4′ 5″ O | Hoheneichen | 53° 38′ 8″ N, 10° 4′ 5″ O | 53° 38′ 8″ N | 10° 4′ 5″ O | 53.635556 | 10.068056 | POINT (53.635555555555555 10.068055555555555) |
41 | Nettelnburg (ANTB)53° 29′ 16″ N, 10° 10′ 52″ O | Nettelnburg | 53° 29′ 16″ N, 10° 10′ 52″ O | 53° 29′ 16″ N | 10° 10′ 52″ O | 53.487778 | 10.181111 | POINT (53.48777777777778 10.181111111111111) |
58 | Sternschanze (ASST)53° 33′ 49″ N, 9° 57′ 57″ O | Sternschanze | 53° 33′ 49″ N, 9° 57′ 57″ O | 53° 33′ 49″ N | 9° 57′ 57″ O | 53.563611 | 9.965833 | POINT (53.56361111111111 9.965833333333332) |
5 | Bahrenfeld (ABAF)53° 33′ 36″ N, 9° 54′ 38″ O | Bahrenfeld | 53° 33′ 36″ N, 9° 54′ 38″ O | 53° 33′ 36″ N | 9° 54′ 38″ O | 53.560000 | 9.910556 | POINT (53.559999999999995 9.910555555555556) |
The two tables are ready to be integrated into to main dataset, and are therefore exported as csv files.
dfs.to_csv("../data/s-bahn_final.csv")
dfu.to_csv("../data/u-bahn_final.csv")
These are the steps for creating new geospatial features that are completely independent of the ones in the core dataset scraped from www.immobilienscout24.de . In the following steps, these features will be integrated into the core dataset and used to create new features for each listing found in the core dataset.