Advanced Geospatial Feature Creation

Demonstrates my expertise in data cleaning and transformation for generating geospatial features. The process involves converting raw GPS data into decimal degrees format and identifying key transportation hubs in Hamburg, Germany.

Tags: data-cleaning data-transformation geospatial-feature-creation regular-expression shapely tabular-data

Category: Data Preprocessing

# imports
import janitor
import re
from pathlib import Path
from subprocess import check_output
import pandas as pd
from shapely.geometry import Point

Advanced Geospatial Feature Creation

Summary

In this article, we start by importing data of all the stations of two main types of public transport found in Hamburg, Germany. They are:

U-Bahn (In the following: subway)
S-Bahn (In the following: suburban train)

Using this data, the goal is to create the following features that will be added to the list of features for every listing found in the core dataset.

The GPS coordinates, as a tuple of latitude and longitude, using the format decimal degrees (DD).
Use boolean columns to mark if the station is for subway, or for suburban trains or both.

The imported data is transformed and features are extracted using regular expressions and the pandas library, to create tidy columns for the new features. All data is made compatible with the main dataset, by converting the location data of the subway and suburban train stops found in the imported flat files from degrees, minutes, and seconds (DMS) to DD

Once, the new features are created, the tabular data with the new features, then is exported as flat files for integration with the main dataset.

The Steps

The following is a flow chart showing the steps used to create the new features from the imported Wikipedia data.

Figure 1 shows the cleaning steps, to create the new features for each listing. They are: 1. Direct distance from each listing to closest subway station. 2. Direct distance from each listing to closest suburban train station. Distance is calculated by using the Python implementation of the haversine function.

Importing The Flat Files

We create a Path variable data that holds the path to the location where the flat files are found at.

base = check_output(["pwd"]).decode("utf-8")[:-18]
base = Path(base)
data = Path.joinpath(base, "data")

We import two flat files. Each one holds to some degree data that can only be found in it and that is needed to build the final public transport geospatial feature set. c_ubahn holds the data for all subway stops and c_sbahn for all suburban train stops. low_memory=False is added, to make sure that pandas evaluates all rows of the DataFrame during each calculation and transformation and not just a small subset of all rows.

# Reading in the data
c_ubahn = pd.read_csv(data / "ubahn_halte_mit_gps.csv", low_memory=False)
c_sbahn = pd.read_csv(data / "sbahn_halte_mit_gps.csv", low_memory=False)

Initial Cleaning

We want to check the column names for both DataFrames, in order to verify that there is the same data available for both types of public transport. Upon looking at the actual headings of the DataFrames, we see that they need to be cleaned and translated first. Afterwards, the matching headings are given matching column names in both DataFrames.

c_sbahn.columns, c_ubahn.columns

(Index(['Bahnhof (Kürzel)Karte', 'EröffnungS-Bahn', 'Bezirk/Ort (kursiv)',
        'Kat', 'Strecke', 'Linien', 'Umstieg', 'Barriere-freier Zugang',
        'Sehenswürdigkeitenöffentliche Einrichtungen', 'Bild'],
       dtype='object'),
 Index(['Bahnhof (Kürzel)Karte', 'Linie', 'Eröffnung', 'Lage', 'Stadtteil/Ort',
        'Umsteige-möglichkeit', 'Barriere-frei', 'Anmerkungen',
        'Sehenswürdigkeitenöffent. Einrichtungen', 'Bild'],
       dtype='object'))

We clean the headings of the two DataFrames by using the clean_names function from the pyjanitor library.

c_ubahn = c_ubahn.clean_names()
c_sbahn = c_sbahn.clean_names()
c_ubahn.columns, c_sbahn.columns

(Index(['bahnhof_kurzel_karte', 'linie', 'eroffnung', 'lage', 'stadtteil_ort',
        'umsteige_moglichkeit', 'barriere_frei', 'anmerkungen',
        'sehenswurdigkeitenoffent_einrichtungen', 'bild'],
       dtype='object'),
 Index(['bahnhof_kurzel_karte', 'eroffnungs_bahn', 'bezirk_ort_kursiv_', 'kat',
        'strecke', 'linien', 'umstieg', 'barriere_freier_zugang',
        'sehenswurdigkeitenoffentliche_einrichtungen', 'bild'],
       dtype='object'))

Defining A Custom Function For Cleaning

All columns, except ‘bahnhof_kurzel_karte’ are of no interest to us, since they do not contain the GPS information for a given station. This is true for both, the underground and suburban train station CSV data.

Since there are two DataFrames that need the same cleaning steps, we save some time defining a function that we hand a DataFrame as input, and that returns the cleaned DataFrame. The cleaning steps it does are:

Rename column ‘bahnhof_kurzel_karte’ to ‘station’.
Drop all columns except the renamed column ‘station’.
Sort the values in this column in ascending order.

def drop_sort(df:pd.DataFrame):

    df = df.rename_column("bahnhof_kurzel_karte", "station")
    df.drop(
        [col for col in df.columns.tolist() if col != "station"], axis=1, inplace=True
    )
    df.sort_values("station", inplace=True)
    return df

Process our two DataFrames and save each returned DataFrame in a new variable.

dfu = drop_sort(c_ubahn)
dfs = drop_sort(c_sbahn)

Check the columns of the newly created DataFrames to verify that only column ‘station’ is still there.

dfu.columns, dfs.columns

(Index(['station'], dtype='object'), Index(['station'], dtype='object'))

Print the first five rows of each DataFrame to get a better understanding of its structure and what regular expression is needed for each, to extract the station name and the GPS coordinates from the ‘station’ column. One can see from the first 5 rows already that two different regular expressions are needed for to extract the values from each one.

for s in dfu, dfs:
    print(f"{s.head(5)}\n\n")

                                            station
 Ahrensburg Ost (AO)53° 39′ 41″ N, 10° 14′ 36″ O
Ahrensburg West (AW)53° 39′ 52″ N, 10° 13′ 12″ O
      Alsterdorf (AL)53° 36′ 23″ N, 10° 0′ 42″ O
  Alter Teichweg (AT)53° 35′ 12″ N, 10° 3′ 52″ O
         Barmbek (BA)53° 35′ 14″ N, 10° 2′ 40″ O


                                             station
    Agathenburg (AABG)53° 33′ 53″ N, 9° 31′ 48″ O
      Allermöhe (AALH)53° 29′ 25″ N, 10° 9′ 31″ O
Alte Wöhr (Stadtpark) (AAW)53° 35′ 51″ N, 10° ...
            Altona (AAS)53° 33′ 7″ N, 9° 56′ 4″ O
       Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O

The Heavy Lifting - Extracting The GPS Coordinates

While the compiled regular expression differs for the two DataFrames dfs and dfu, the steps to extract the name of the station from the ‘station’ column are identical. They are:

Create a compiled regular expression pattern.
Use pandas string accessor to gain access to the str.extract function.
- search the column for matches.
- extract whatever is matched by the capture group of the compiled regular expression.
- Assign the matched portion to a new column ‘name’.
Verify the results by printing columns ‘station’ and ‘name’ to check each row in ‘name’ is correct.

bhf = re.compile("(^.+)\s\(\w+\)")
dfs["name"] = dfs["station"].str.extract(bhf, expand=False)

dfs.loc[:, ["station", "name"]]

	station	name
0	Agathenburg (AABG)53° 33′ 53″ N, 9° 31′ 48″ O	Agathenburg
1	Allermöhe (AALH)53° 29′ 25″ N, 10° 9′ 31″ O	Allermöhe
2	Alte Wöhr (Stadtpark) (AAW)53° 35′ 51″ N, 10° ...	Alte Wöhr (Stadtpark)
3	Altona (AAS)53° 33′ 7″ N, 9° 56′ 4″ O	Altona
4	Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O	Aumühle
...	...	...
63	Wandsbeker Chaussee (AWCH)Lage	Wandsbeker Chaussee
64	Wedel (AWL)53° 34′ 55″ N, 9° 42′ 18″ O	Wedel
65	Wellingsbüttel (AWBS)53° 38′ 28″ N, 10° 4′ 57″ O	Wellingsbüttel
66	Wilhelmsburg (AWFS)53° 29′ 53″ N, 10° 0′ 24″ O	Wilhelmsburg
67	Wohltorf (AWLF)53° 31′ 14″ N, 10° 16′ 40″ O	Wohltorf

68 rows × 2 columns

bhf_ubahn = re.compile("(^[-a-zA-Züäöß.ÜÖ,Ä]{0,30}\s?[-a-zA-Züäöß.ÜÖ,Ä]{0,30}[^(\d])")
dfu["name"] = dfu["station"].str.extract(bhf_ubahn, expand=False)

dfu.loc[:, ["station", "name"]]

	station	name
0	Ahrensburg Ost (AO)53° 39′ 41″ N, 10° 14′ 36″ O	Ahrensburg Ost
1	Ahrensburg West (AW)53° 39′ 52″ N, 10° 13′ 12″ O	Ahrensburg West
2	Alsterdorf (AL)53° 36′ 23″ N, 10° 0′ 42″ O	Alsterdorf
3	Alter Teichweg (AT)53° 35′ 12″ N, 10° 3′ 52″ O	Alter Teichweg
4	Barmbek (BA)53° 35′ 14″ N, 10° 2′ 40″ O	Barmbek
...	...	...
96	Wandsbek Markt (WM)53° 34′ 19″ N, 10° 4′ 4″ O	Wandsbek Markt
95	Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′...	Wandsbek-Gartenstadt
97	Wandsbeker Chaussee (WR)53° 34′ 12″ N, 10° 3′ ...	Wandsbeker Chaussee
98	Wartenau (WA)53° 33′ 51″ N, 10° 2′ 4″ O	Wartenau
92	Überseequartier (UR)53° 32′ 26″ N, 9° 59′ 56″ O	Überseequartier

99 rows × 2 columns

Extract Entire Coordinate Pairs

Given that the coordinates have format DMS in the input files, regular expressions are used to extract the entire coordinate pair for each station. After looking at the value range for the minute component across all rows, there can only be one value for the minute component that is 53. The pattern matches and captures everything until the last capital O, which there is only one, which marks the end of one complete coordinate pair.

# five = re.compile("([4-6][1-6].+O)")
five = re.compile("(53.+O)")
dfu["gps"] = dfu["station"].str.extract(five, expand=False)
dfs["gps"] = dfs["station"].str.extract(five, expand=False)

Check For Missing Values

For the subway stations, there are no missing values in the GPS column. Which means that the capture group extracted something for all rows.

dfu[dfu.gps.isna() == True]

	station	name	gps

While there are two missing values found in the table for the suburban trains, it shows that these missing values were not introduced as a result of the regular expressions pattern not matching the coordinates for two rows in the DataFrame. We can see that there are no coordinate values found in the ‘station’ column and that is why rows 25 and 63 in the GPS column have missing values. This will be fixed in the next step.

dfs[dfs.gps.isna() == True]

	station	name	gps
25	Hauptbahnhof (AHS)Lage	Hauptbahnhof	NaN
63	Wandsbeker Chaussee (AWCH)Lage	Wandsbeker Chaussee	NaN

Fill Missing Values

The missing values are for stations at ‘Hauptbahnhof’ and ‘Wandsbeker Chaussee’. We look for other stations at these locations that have GPS values. Their GPS coordinates should be close to those of the missing stations.

No other entries, apart from the two rows with missing GPS coordinates are found in the suburban train stations DataFrame. Next, we try to find rows in the subway stations DataFrame that have valid GPS coordinates for the two stations.

dfs[dfs["station"].str.contains(r"Haupt|Wandsbek")]

	station	name	gps
25	Hauptbahnhof (AHS)Lage	Hauptbahnhof	NaN
63	Wandsbeker Chaussee (AWCH)Lage	Wandsbeker Chaussee	NaN

Valid values for both stations are found in the subways stations DataFrame. Row indexes of these rows are 34 for ‘Hauptbahnhof’ and 97 for ‘Wandsbeker Chaussee’. These indexes are used to fill the missing values.

dfu[dfu["station"].str.contains("Haupt|Wandsbek")]

	station	name	gps
34	Hauptbahnhof Nord (HX)53° 33′ 15″ N, 10° 0′ 25″ O	Hauptbahnhof Nord	53° 33′ 15″ N, 10° 0′ 25″ O
35	Hauptbahnhof Süd (HB)53° 33′ 8″ N, 10° 0′ 30″ O	Hauptbahnhof Süd	53° 33′ 8″ N, 10° 0′ 30″ O
96	Wandsbek Markt (WM)53° 34′ 19″ N, 10° 4′ 4″ O	Wandsbek Markt	53° 34′ 19″ N, 10° 4′ 4″ O
95	Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′...	Wandsbek-Gartenstadt	53° 35′ 32″ N, 10° 4′ 27″ O
97	Wandsbeker Chaussee (WR)53° 34′ 12″ N, 10° 3′ ...	Wandsbeker Chaussee	53° 34′ 12″ N, 10° 3′ 31″ O

Using the row indexes in the subway stations DataFrame from above, the missing values in the suburban train stations DataFrame are filled manually by overwriting their values in the gps column.

dfs.loc[25, "gps"] = dfu.loc[34, "gps"]
dfs.loc[63, "gps"] = dfu.loc[97, "gps"]

Splitting on ‘,’, we can create two new columns from the values in the gps column. The new columns will hold the latitude values and the longitude values respectively for each station.

dfu[["lat", "lng"]] = dfu["gps"].str.split(",", expand=True)
dfs[["lat", "lng"]] = dfs["gps"].str.split(",", expand=True)

The output below verifies for the first five entries of the lat and lng columns that each coordinate pair has structure (Latitude, Longitude) coordinate values.

dfu[['lat','lng']].head()

	lat	lng
0	53° 39′ 41″ N	10° 14′ 36″ O
1	53° 39′ 52″ N	10° 13′ 12″ O
2	53° 36′ 23″ N	10° 0′ 42″ O
3	53° 35′ 12″ N	10° 3′ 52″ O
4	53° 35′ 14″ N	10° 2′ 40″ O

All lines of the lat and lng columns should have the correct value pairs that is lat values should always start with digits 53 and end with a capital N, while values in the lng column should either start with digits 10 or digit 9 and always end with a capital O.

The actual verification is done, once the values are converted to DD format, since the effort to verify the values is much smaller compared to doing it in the current state, before the conversion.

Before defining the function to convert the GPS coordinates from DMS to DD format, we look at one row in the latitude and longitude columns of both DataFrames. The structure of the values looks identically in both that means that there should be no need to define two separate conversion functions, one should suffice.

def prev(df: pd.DataFrame):
    print(
        f"latitude component looks like: {df.lat[49]},\n\nlongitude component looks like {df.lng[49]}"
    )

prev(dfu)

latitude component looks like: 53° 39′ 39″ N,

longitude component looks like  10° 1′ 3″ O

prev(dfs)

latitude component looks like: 53° 39′ 7″ N,

longitude component looks like  10° 5′ 38″ O

In order to convert the GPS values from DMS to DD, we define functions that will do the conversion for us. We test the output of the function to make sure the output is as expected.

def dms2dd(degrees, minutes, seconds, direction):
    dd = float(degrees) + float(minutes) / 60 + float(seconds) / (60 * 60)
    if direction == "E":
        dd *= -1
    return dd

def parse_dms(dms):
    parts = re.split("[^\d\w\.]+", dms)
    lat = dms2dd(parts[0], parts[1], parts[2], parts[3])
    return lat

dd = parse_dms("78°55'44.33324\" N")
print(dd)

78.92898145555556

print(dfs.lat[4])
print(f"{parse_dms(dfs.lat[4])}")

53° 31′ 48″ N
53.53

All rows in the lat and lng columns are stripped of any remaining whitespace, if there is any to be found. This is done prior to the parse_dms function being applied, since the regex pattern might fail to match all relevant parts of the coordinates, if there was any whitespace to be found that was not accounted for.

dfu["lng"] = dfu["lng"].str.strip()
dfu["lat"] = dfu["lat"].str.strip()
dfs["lng"] = dfs["lng"].str.strip()
dfs["lat"] = dfs["lat"].str.strip()

The custom conversion function parse_dms is applied to all rows in the lat and lng columns of both DataFrames respectively and the results are saved in new columns that share the same suffix _dd, alias for decimal degrees format. The reason for creating these new columns, is that we want to be able to compare the values before and after the conversion in each row of the DataFrames.

dfu["lat_dd"] = dfu["lat"].apply(lambda x: parse_dms(x))
dfu["lng_dd"] = dfu["lng"].apply(lambda x: parse_dms(x))
dfs["lat_dd"] = dfs["lat"].apply(lambda x: parse_dms(x))
dfs["lng_dd"] = dfs["lng"].apply(lambda x: parse_dms(x))

Final Verification Of The GPS Coordinates

In order to verify that the coordinates are valid for all rows of the lat_dd and lng_dd columns in both DataFrames, a custom function is used to check both DataFrames.

The function takes a pandas DataFrame as input. It compiles two regular expression pattern at runtime. The patterns both match from the start of each entry till the end of it. They match all valid digits that can occur before the decimal seperator and match any digit, any number of times after the decimal seperator, until the end of each entry. Using $ makes sure that only digits are found in each row until the end of each value.

def vf(df:pd.DataFrame):
    lat_pat=re.compile('^53\.\d+$')
    lng_pat=re.compile('^1?[09]\.\d+$')
    for r in df[['lat_dd','lng_dd']].iterrows():
        m1,m2=lat_pat.match(str(r[1][0])), lng_pat.match(str(r[1][1]))
        assert m1.end() == len(str(r[1][0]))
        assert m2.end() == len(str(r[1][1]))

Both DataFrames pass the verification for all rows. See below:

vf(dfu)

vf(dfs)

Creating The Final GPS Column

Using the assign function from the pandas library, a new column gps_dd is created. Its values are tuples of GPS coordinates for each subway and suburban train station found in the data. This step is needed in order to apply the Point conversion from the shapely.geometry module. This conversion makes it possible to integrate the features created here into the listings data in the core dataset. It enables us to compute distances between listings and any station found in the two datasets processed here. The ability to compute distances between features gives a powerful tool that can enhance the results we get from interpreting a tree based model in the later stages of this project for example.

dfs = dfs.assign(
    gps_dd=list(dfs[["lat_dd", "lng_dd"]].itertuples(index=False, name=None))
)
dfu = dfu.assign(
    gps_dd=list(dfu[["lat_dd", "lng_dd"]].itertuples(index=False, name=None))
)

Applying the Point conversion can throw a lot of errors that would require us to look at all the previous steps and carefully apply the transformations again possibly. The documentation of the shapely library has been the best source of knowledge for me to find answers to problems encountered when trying to apply the Point conversion.

dfs["gps_dd"] = dfs["gps_dd"].apply(Point)
dfu["gps_dd"] = dfu["gps_dd"].apply(Point)

Final Glimpse At The Tables

This is how the tables look like, in their exported form. The intermediate columns are kept, for better interpretability of the data, should any problems arise later down the line. Pandas understands the data, which was converted to Point and treat it as a column of type shapely.geometry.Point.

def s(df,seed=42):
       return df.sample(n=8,random_state=seed)

s(dfu)

	station	name	gps	lat	lng	lat_dd	lng_dd	gps_dd
61	Mümmelmannsberg (MG)53° 31′ 41″ N, 10° 8′ 53″ O	Mümmelmannsberg	53° 31′ 41″ N, 10° 8′ 53″ O	53° 31′ 41″ N	10° 8′ 53″ O	53.528056	10.148056	POINT (53.528055555555554 10.148055555555555)
40	Hudtwalckerstraße (HU)53° 35′ 40″ N, 9° 59′ 46″ O	Hudtwalckerstraße	53° 35′ 40″ N, 9° 59′ 46″ O	53° 35′ 40″ N	9° 59′ 46″ O	53.594444	9.996111	POINT (53.59444444444445 9.99611111111111)
95	Wandsbek-Gartenstadt (WK)53° 35′ 32″ N, 10° 4′...	Wandsbek-Gartenstadt	53° 35′ 32″ N, 10° 4′ 27″ O	53° 35′ 32″ N	10° 4′ 27″ O	53.592222	10.074167	POINT (53.592222222222226 10.074166666666667)
18	Emilienstraße (EM)53° 34′ 18″ N, 9° 57′ 9″ O	Emilienstraße	53° 34′ 18″ N, 9° 57′ 9″ O	53° 34′ 18″ N	9° 57′ 9″ O	53.571667	9.952500	POINT (53.57166666666667 9.952499999999999)
98	Wartenau (WA)53° 33′ 51″ N, 10° 2′ 4″ O	Wartenau	53° 33′ 51″ N, 10° 2′ 4″ O	53° 33′ 51″ N	10° 2′ 4″ O	53.564167	10.034444	POINT (53.564166666666665 10.034444444444444)
84	St. Pauli (PA)53° 33′ 2″ N, 9° 58′ 12″ O	St. Pauli	53° 33′ 2″ N, 9° 58′ 12″ O	53° 33′ 2″ N	9° 58′ 12″ O	53.550556	9.970000	POINT (53.550555555555555 9.97)
64	Niendorf Nord (NN)53° 38′ 26″ N, 9° 57′ 1″ O	Niendorf Nord	53° 38′ 26″ N, 9° 57′ 1″ O	53° 38′ 26″ N	9° 57′ 1″ O	53.640556	9.950278	POINT (53.64055555555556 9.950277777777776)
42	Jungfernstieg (JG)53° 33′ 8″ N, 9° 59′ 37″ O	Jungfernstieg	53° 33′ 8″ N, 9° 59′ 37″ O	53° 33′ 8″ N	9° 59′ 37″ O	53.552222	9.993611	POINT (53.55222222222222 9.993611111111111)

s(dfs)

	station	name	gps	lat	lng	lat_dd	lng_dd	gps_dd
46	Ohlsdorf (AOPS)53° 37′ 14″ N, 10° 1′ 54″ O	Ohlsdorf	53° 37′ 14″ N, 10° 1′ 54″ O	53° 37′ 14″ N	10° 1′ 54″ O	53.620556	10.031667	POINT (53.620555555555555 10.031666666666668)
15	Elbgaustraße (AEGS)53° 36′ 10″ N, 9° 53′ 35″ O	Elbgaustraße	53° 36′ 10″ N, 9° 53′ 35″ O	53° 36′ 10″ N	9° 53′ 35″ O	53.602778	9.893056	POINT (53.60277777777778 9.893055555555556)
4	Aumühle (AAMS)53° 31′ 48″ N, 10° 18′ 53″ O	Aumühle	53° 31′ 48″ N, 10° 18′ 53″ O	53° 31′ 48″ N	10° 18′ 53″ O	53.530000	10.314722	POINT (53.53 10.314722222222223)
9	Billwerder-Moorfleet (ABWM)53° 30′ 57″ N, 10° ...	Billwerder-Moorfleet	53° 30′ 57″ N, 10° 5′ 47″ O	53° 30′ 57″ N	10° 5′ 47″ O	53.515833	10.096389	POINT (53.51583333333333 10.096388888888889)
28	Hoheneichen (AHCH)53° 38′ 8″ N, 10° 4′ 5″ O	Hoheneichen	53° 38′ 8″ N, 10° 4′ 5″ O	53° 38′ 8″ N	10° 4′ 5″ O	53.635556	10.068056	POINT (53.635555555555555 10.068055555555555)
41	Nettelnburg (ANTB)53° 29′ 16″ N, 10° 10′ 52″ O	Nettelnburg	53° 29′ 16″ N, 10° 10′ 52″ O	53° 29′ 16″ N	10° 10′ 52″ O	53.487778	10.181111	POINT (53.48777777777778 10.181111111111111)
58	Sternschanze (ASST)53° 33′ 49″ N, 9° 57′ 57″ O	Sternschanze	53° 33′ 49″ N, 9° 57′ 57″ O	53° 33′ 49″ N	9° 57′ 57″ O	53.563611	9.965833	POINT (53.56361111111111 9.965833333333332)
5	Bahrenfeld (ABAF)53° 33′ 36″ N, 9° 54′ 38″ O	Bahrenfeld	53° 33′ 36″ N, 9° 54′ 38″ O	53° 33′ 36″ N	9° 54′ 38″ O	53.560000	9.910556	POINT (53.559999999999995 9.910555555555556)

Export The Cleaned Station Data

The two tables are ready to be integrated into to main dataset, and are therefore exported as csv files.

dfs.to_csv("../data/s-bahn_final.csv")
dfu.to_csv("../data/u-bahn_final.csv")

These are the steps for creating new geospatial features that are completely independent of the ones in the core dataset scraped from www.immobilienscout24.de . In the following steps, these features will be integrated into the core dataset and used to create new features for each listing found in the core dataset.