Mastery in Pandas: In-Depth Data Exploration, Part 1

This article showcases my expertise in using pandas for advanced data exploration. It focuses on analyzing a 47-column dataset, providing insights into leveraging pandas for complex tabular data management.

Mastery in Pandas: In-Depth Data Exploration, Part 1

Summary Of The Series

This series demonstrates my deep expertise in pandas and pyjanitor for advanced data exploration and cleaning. In Part 1, “Mastery in Pandas: In-Depth Data Exploration, Part 1,” a 47-column dataset is analyzed to showcase complex tabular data management. Dimensionality reduction techniques are applied to retain only relevant columns. Part 2, “PyJanitor Proficiency: Efficient String Data Cleaning, Part 2,” leverages pyjanitor’s method chaining syntax to enhance the speed and efficiency of string data cleaning. The focus is on unique column values to guide the cleaning steps.

Regular expressions (regex) are extensively used in Part 4, “Advanced Data Cleaning and Validation: Batch Processing with Pandas, Part 4,” for extracting and validating cell contents, replacing invalid patterns with valid ones. Additionally, this part emphasizes handling large volumes of tabular data through batch processing. Part 3, “Geospatial Engineering in Pandas: Creating Valid GPS Columns, Part 3,” highlights the creation and validation of Point geometry gps columns by merging longitude and latitude fields into GPS data columns. The series culminates with the creation of new timedelta64[ns] time_listed and gps columns, illustrating advanced data cleaning and validation techniques.


Mastery in Pandas: In-Depth Data Exploration, Part 1
PyJanitor Proficiency: Efficient String Data Cleaning, Part 2
Geospatial Engineering in Pandas: Creating Valid GPS Columns, Part 3
Advanced Data Cleaning and Validation: Batch Processing with Pandas, Part 4


Reading In The Input Data

The input data, crucial for analyzing rental listings in Hamburg, is divided into three CSV files. These files collectively encompass all rental listings available on ‘ImmosScout24’, a prominent German rental listing website, within the city boundaries at the time of scraping. The official website can be found at ImmoScout24.

The primary goal at this stage is to clean and prepare a tidy DataFrame, setting the foundation for subsequent stages such as Feature Engineering, Exploratory Data Analysis, Machine Learning, and ultimately presenting the solution to stakeholders.

Preliminary Steps

First, we import the necessary Python libraries and set a seed for reproducibility:

import pandas as pd  # Library for DataFrame manipulation
seed = 42  # Set seed for reproducible random outputs

Data Preparation

The paths to the scraping data are defined, and DataFrames are created using a custom function read_and_prepare_data. This function enhances readability, ensures consistent data type handling, and follows the principle of DRY (Don’t Repeat Yourself). The low_memory=False option is used in pd.read_csv() to avoid dtype guessing, ensuring more accurate data types, particularly for columns with mixed types.

The DataFrames are then concatenated, and duplicate rows are removed. This approach is more efficient and cleaner than appending DataFrames.

# File paths for the scraping data
scraping_1 = "../data/20181203-first_scraping_topage187.csv"
scraping_2 = "../data/20181203-second_scraping_topage340.csv"
scraping_3 = "../data/20181203-third_scraping_topage340.csv"

def read_and_prepare_data(file_path: str) -> pd.DataFrame:
    """
    Reads and processes CSV data from a given file path.
    Uses low_memory=False for accurate data type detection.
    """
    return pd.read_csv(file_path, index_col=False, parse_dates=False, low_memory=False)

# Reading and processing the data
df1 = read_and_prepare_data(scraping_1)
df2 = read_and_prepare_data(scraping_2)
df3 = read_and_prepare_data(scraping_3)

# Concatenating and cleaning dataframes
df = pd.concat([df1, df2, df3], ignore_index=True).drop(columns=["Unnamed: 0"]).drop_duplicates()

Summary

The updated approach adheres to best practices in Python programming, ensuring efficiency, readability, and maintainability of the code. The initial data preprocessing sets a strong foundation for the upcoming data analysis and machine learning stages.

First Look At The DataFrame

To get a first look at the newly created DataFrame df, one can choose between multiple tools in the pandas library. It is assumed that the Dataframe is named df in the following, as a couple of the tools, the pandas library has to offer, are described and links to the documentation page of each command are added for more detail on how each command works.

Commands

df.head()

Description

The command df.head()returns the first 5 rows of the DataFrame by default, if no parameters are specified by the user. Using the parameter n, one can specify the number of rows that get returned. Needless to say, rows returned will always start at the first index value and include the following n-1 rows.

Example

In the first call to df.head(), the default value for number of lines printed (n=5) is used by not specifying any parameter value in the function call. In the second call, the number of lines printed is changed to n =9.

df.head() # Includes index values [0:4], which is default (n=5)
einbau_kue lift nebenkosten gesamt_miete heiz_kosten lat lng str nutzf regio ... json_firingTypes json_hasKitchen json_cellar json_yearConstructedRange json_baseRent json_livingSpace json_condition json_interiorQual json_petsAllowed json_lift
0 Einbauküche NaN 190,84 541,06 inkl. 78 € ['lat: 53.52943934146104,'] ['lng: 10.152357145948395'] Strietkoppel 20 NaN 22115 Hamburg Billstedt ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"3"'] ['"obj_baseRent":"350.22"'] ['"obj_livingSpace":"76"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
1 NaN NaN 117,95 559,05 inkl. 52,07 € ['lat: 53.58414266878421,'] ['lng: 10.06842724545814'] Naumannplatz 2 NaN 22049 Hamburg Dulsberg ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"1"'] ['"obj_baseRent":"441.1"'] ['"obj_livingSpace":"60"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
2 NaN NaN 249,83 839,01 inkl. 110,58 € ['lat: 53.6044918821393,'] ['lng: 9.86423115803761'] Warthestr. 52a NaN 22547 Hamburg Lurup ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"5"'] ['"obj_baseRent":"589.18"'] ['"obj_livingSpace":"75"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
3 Einbauküche NaN 70 665 (zzgl. Heizkosten) nicht in Nebenkosten enthalten ['lat: 53.56394970119381,'] ['lng: 9.956881419437474'] Oelkersallee 53 NaN 22769 Hamburg Altona-Nord ... ['"obj_firingTypes":"no_information"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"595"'] ['"obj_livingSpace":"46"'] ['"obj_condition":"well_kept"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
4 Einbauküche NaN 213,33 651,81 inkl. 57,78 € ['lat: 53.60180649336605,'] ['lng: 10.081257248271337'] Haldesdorfer Str. 119a NaN 22179 Hamburg Bramfeld ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"6"'] ['"obj_baseRent":"438.48"'] ['"obj_livingSpace":"52"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']

5 rows × 47 columns

df.head(n=3) # Includes index values [0:2]
einbau_kue lift nebenkosten gesamt_miete heiz_kosten lat lng str nutzf regio ... json_firingTypes json_hasKitchen json_cellar json_yearConstructedRange json_baseRent json_livingSpace json_condition json_interiorQual json_petsAllowed json_lift
0 Einbauküche NaN 190,84 541,06 inkl. 78 € ['lat: 53.52943934146104,'] ['lng: 10.152357145948395'] Strietkoppel 20 NaN 22115 Hamburg Billstedt ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"3"'] ['"obj_baseRent":"350.22"'] ['"obj_livingSpace":"76"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
1 NaN NaN 117,95 559,05 inkl. 52,07 € ['lat: 53.58414266878421,'] ['lng: 10.06842724545814'] Naumannplatz 2 NaN 22049 Hamburg Dulsberg ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"1"'] ['"obj_baseRent":"441.1"'] ['"obj_livingSpace":"60"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
2 NaN NaN 249,83 839,01 inkl. 110,58 € ['lat: 53.6044918821393,'] ['lng: 9.86423115803761'] Warthestr. 52a NaN 22547 Hamburg Lurup ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"5"'] ['"obj_baseRent":"589.18"'] ['"obj_livingSpace":"75"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']

3 rows × 47 columns

df.tail()

Description

The command df.tail() is the counterpart to df.head(), it returns the last 5 rows of the DataFrame by default, if no parameters are specified by the user. Using the parameter n, one can specify the number of rows that get returned. Needless to say, rows returned will always end with the row at the last index value and include the preceding n-1 rows.

Example

First the maximum of the index of df is checked, to show that the last printed row is indeed the last value in the index of the DataFrame, other than that the examples mirror the two from the df.head() command, to display their similarities.

print('The maximum value of the range index of df is %s' % df.index.max())
df.tail()
The maximum value of the range index of df is 12494
einbau_kue lift nebenkosten gesamt_miete heiz_kosten lat lng str nutzf regio ... json_firingTypes json_hasKitchen json_cellar json_yearConstructedRange json_baseRent json_livingSpace json_condition json_interiorQual json_petsAllowed json_lift
12490 Einbauküche NaN 80 1.058 47 € ['lat: 53.586938610218276,'] ['lng: 10.016258235901141'] Goldbekufer 29 8 m² 22303 Hamburg Winterhude ... ['"obj_firingTypes":"oil"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"y"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"931"'] ['"obj_livingSpace":"66.5"'] ['"obj_condition":"mint_condition"'] ['"obj_interiorQual":"sophisticated"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
12491 Einbauküche NaN 61 674 26 € ['lat: 53.55758333359278,'] ['lng: 10.03986901076397'] Burgstraße 34 NaN 20535 Hamburg Hamm-Nord ... ['"obj_firingTypes":"oil"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"y"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"587"'] ['"obj_livingSpace":"51"'] ['"obj_condition":"refurbished"'] ['"obj_interiorQual":"sophisticated"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
12492 Einbauküche NaN 76 752 70 € ['lat: 53.6486450531966,'] ['lng: 10.039966464612842'] Bei der Ziegelei 4 8 m² 22339 Hamburg Hummelsbüttel ... ['"obj_firingTypes":"oil"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"y"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"606"'] ['"obj_livingSpace":"63.69"'] ['"obj_condition":"refurbished"'] ['"obj_interiorQual":"sophisticated"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
12493 Einbauküche Personenaufzug 59,93 382,73 21,62 € ['lat: 53.54886472789119,'] ['lng: 10.079737639604678'] Culinstraße 58 NaN 22111 Hamburg Horn ... ['"obj_firingTypes":"district_heating"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"301.18"'] ['"obj_livingSpace":"30.89"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"y"']
12494 NaN NaN 101 499,91 in Nebenkosten enthalten ['lat: 53.46145736469006,'] ['lng: 9.966232033537533'] Denickestraße 42 b NaN 21075 Hamburg Harburg ... [] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"398.91"'] ['"obj_livingSpace":"46.93"'] ['"obj_condition":"well_kept"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']

5 rows × 47 columns

df.columns

Description

The command df.columns does one thing and one thing well, one might say. It returns a list of strings, the list of the columns of the DataFrame. Its output can be iterated through, in order to select subsets of all columns. An iteration can be done in the form of a list comprehension that makes use of conditional clauses for example. It also helps one find problematic column names that need to be changed in order to qualify as tidy. The output of it also helps one get an overview of all the columns names and therefore is a starting point for dropping certain columns and renaming the columns, so they follow an easy to remember and precise naming pattern.

Example

Below, the set of columns of the DataFrame are printed. One can see that there are two types of patterns found in the names of the columns.

  1. The first set of columns originates from values found in listings that are visible to the visitor. These ones have no prefix attached to them and they all have German names.
  2. Columns in the second set have the prefix ‘json_’ added to them. This comes from the fact that they were sourced from a script tag found in the raw HTML code of each listing. The inner HTML of these script tags consisted of key-value pairs using a json like formatting. It was not machine readable though. The names of these columns were in English already and only the ‘json_’ prefix was added afterwards.

There are several other differences between the sets, as we will see later.

df.columns
Index(['einbau_kue', 'lift', 'nebenkosten', 'gesamt_miete', 'heiz_kosten',
       'lat', 'lng', 'str', 'nutzf', 'regio', 'parking', 'online_since',
       'baujahr', 'objekt_zustand', 'heizungsart', 'wesent_energietr',
       'endenergiebedarf', 'kaltmiete', 'quadratmeter', 'anzahl_zimmer',
       'balkon/terasse', 'keller', 'typ', 'etage', 'anz_schlafzimmer',
       'anz_badezimmer', 'haustiere', 'nicht_mehr_verfg_seit',
       'json_heatingType', 'json_balcony', 'json_electricityBasePrice',
       'json_picturecount', 'json_telekomDownloadSpeed',
       'json_telekomUploadSpeed', 'json_totalRent', 'json_yearConstructed',
       'json_electricityKwhPrice', 'json_firingTypes', 'json_hasKitchen',
       'json_cellar', 'json_yearConstructedRange', 'json_baseRent',
       'json_livingSpace', 'json_condition', 'json_interiorQual',
       'json_petsAllowed', 'json_lift'],
      dtype='object')

df.index

Description

The command df.index prints the type of the index of the DataFrame, as well as a couple of index values from the beginning and end of the 64bit integer index range in our example. When the final DataFrame df was created, using the following line of code:

df = df1.append([df2, df3], ignore_index=True)

The ignore_index=True part was important to make sure that the range indexes of each of the appended DataFrames df2 and df3 would not simply get stacked on top of the index of df1. Would that have happened the resulting index would have been unusable, since there would not have been a monotonously increasing range index in the resulting DataFrame.

Example

In the example it is shown what the resulting index of the final DataFrame would have been, if parameter * ignore_index* would not have been specified at all (df_index_1) and what it would have been, given ignore_index=False (df_index_2). The resulting index is the same in both cases and it is important that one knows exactly how the index of any DataFrame looks like, in order to be able to manipulate and clean it. The resulting range index of the DataFrame, given ignore_index=True is used in the input statement shows all the qualities a simple range index should have.

df_index_1 = df1.append([df2, df3])
df_index_2 = df1.append([df2, df3], ignore_index=False)
print('The resulting index, if no value is specified:\n %s\n ' % df_index_1.index)
print('The resulting index, if False is used:\n %s\n ' % df_index_2.index)
The resulting index, if no value is specified:
 Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            5582, 5583, 5584, 5585, 5586, 5587, 5588, 5589, 5590, 5591],
           dtype='int64', length=12495)

The resulting index, if False is used:
 Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            5582, 5583, 5584, 5585, 5586, 5587, 5588, 5589, 5590, 5591],
           dtype='int64', length=12495)
print('The resulting index, if True is used:\n %s\n ' % df.index)
The resulting index, if True is used:
 Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            12485, 12486, 12487, 12488, 12489, 12490, 12491, 12492, 12493,
            12494],
           dtype='int64', length=12325)

df.describe()

Description

The command df.describe() gives summary statistics for all columns that are of a numerical data type (dtype) by default. In the default case, the following statistics are included in the output for each included column. The following notation will be used from this point onwards: \(np.nan\) stands for missing values and \(\neg np.nan\) stands for non missing values.

Example

The data types (dtypes) in the DataFrame are checked, before df.describe() is explored for df.

df.dtypes
einbau_kue                    object
lift                          object
nebenkosten                   object
gesamt_miete                  object
heiz_kosten                   object
lat                           object
lng                           object
str                           object
nutzf                         object
regio                         object
parking                       object
online_since                  object
baujahr                       object
objekt_zustand                object
heizungsart                   object
wesent_energietr              object
endenergiebedarf              object
kaltmiete                     object
quadratmeter                  object
anzahl_zimmer                 object
balkon/terasse                object
keller                        object
typ                           object
etage                         object
anz_schlafzimmer             float64
anz_badezimmer               float64
haustiere                     object
nicht_mehr_verfg_seit         object
json_heatingType              object
json_balcony                  object
json_electricityBasePrice     object
json_picturecount             object
json_telekomDownloadSpeed     object
json_telekomUploadSpeed       object
json_totalRent                object
json_yearConstructed          object
json_electricityKwhPrice      object
json_firingTypes              object
json_hasKitchen               object
json_cellar                   object
json_yearConstructedRange     object
json_baseRent                 object
json_livingSpace              object
json_condition                object
json_interiorQual             object
json_petsAllowed              object
json_lift                     object
dtype: object

From the output one can see that there only 2 columns that exclusively hold numerical data and thus have a numerical * data type* (dtype). All other columns have mixed dtypes, so pandas labels them as having dtype ‘object’. In the following, all columns will be checked and their dtypes might change in the process of cleaning them.

With the information that only 2 columns have a numerical dtype, calling df.describe() with no further parameters specified, will print the summary statistics listed above only for those two columns. See the output below, for more details.

df.describe()
anz_schlafzimmer anz_badezimmer
count 6469.000000 7317.000000
mean 1.579379 1.096351
std 0.746926 0.321261
min 0.000000 0.000000
25% 1.000000 1.000000
50% 1.000000 1.000000
75% 2.000000 1.000000
max 8.000000 11.000000

We will use the summary statistics for variable anz_schlafzimmer as an example of how one can interpret their values for a given data series. The variable gives the number of bedrooms that a listing has, according to the data found in the listing on the website.

The distributions will be analyzed at a later stage, for now the focus is on getting a ‘first look’ at the DataFrame.

Below one finds the value counts for variable anz_schlafzimmer, which describes the number of bedrooms found in each listing.

df['anz_schlafzimmer'].value_counts()
1.0    3544
2.0    2207
3.0     594
4.0      97
5.0      15
0.0      10
8.0       1
6.0       1
Name: anz_schlafzimmer, dtype: int64

df.shape

Description

The command returns a tuple object which has two numerical values. Let $(x,y)$ be the output of df.shape, a tuple object where \(x\) gives the number of rows df has, while \(y\) gives the number of columns of it.

Example

See below for the created df.

df.shape
(12325, 47)

df.count()

Description

df.count() returns the count of all \(\neg np.nan\) for each column or for a subset.

Example

In the example, the output was shortened by only including 4 randomly selected columns out of the 47 columns in the df.

df.count().sample(4,random_state=seed)
nicht_mehr_verfg_seit    11629
json_cellar              12324
haustiere                 3914
json_condition           12324
dtype: int64

df.nunique()

Description

Returns an integer value that gives the number of unique values in the data frame or of a subset of columns in the df. It does not return the unique values themselves.

Example

The example shows how it can be applied to df and what the output looks like. A subset of the columns is used again, to keep the output readable.

df.nunique().sample(4,random_state=seed)
nicht_mehr_verfg_seit    701
json_cellar                2
haustiere                  3
json_condition            10
dtype: int64

df.filter()

Description

df.filter() can be used like df.loc, if parameter items is added. It prints the columns specified as a list of column names. However, where it shines is when there are subsets of columns that have a certain pattern in their names. In this case, one can use parameter regex, followed by a regex pattern along with the parameter and value axis=1.

Example

In the first example df.filter() is used like df.loc to select a subset of two columns. The second example shows how regex can be used to filter certain columns. As mentioned earlier, in the DataFrame constructed there is a subset of columns, whose names all begin with the prefix ‘json_’. Using regex, makes it easy to filter out these columns.

Example 1 - Using df.filter() to select a subset of columns.

df.filter(items=['lift','str']).sample(4,random_state=seed)
lift str
9939 NaN NaN
1217 NaN Alter Güterbahnhof 10a
6023 NaN Jütlandring 48
12239 NaN Estebogen 22

Example 2 - Using df.filter() to select all columns that have the prefix ‘json_’ in their names.

df.filter(regex='^json', axis=1).sample(4,random_state=seed)
json_heatingType json_balcony json_electricityBasePrice json_picturecount json_telekomDownloadSpeed json_telekomUploadSpeed json_totalRent json_yearConstructed json_electricityKwhPrice json_firingTypes json_hasKitchen json_cellar json_yearConstructedRange json_baseRent json_livingSpace json_condition json_interiorQual json_petsAllowed json_lift
9939 [] ['"obj_balcony":"y"'] ['"obj_electricityBasePrice":"90.76"'] ['"obj_picturecount":"12"'] ['"obj_telekomDownloadSpeed":"16 MBit/s"'] ['"obj_telekomUploadSpeed":"2,4 MBit/s"'] ['"obj_totalRent":"1000"'] ['"obj_yearConstructed":"1983"'] ['"obj_electricityKwhPrice":"0.1985"'] ['"obj_firingTypes":"no_information"'] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"4"'] ['"obj_baseRent":"750"'] ['"obj_livingSpace":"87"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
1217 [] ['"obj_balcony":"y"'] ['"obj_electricityBasePrice":"90.76"'] ['"obj_picturecount":"5"'] [] [] ['"obj_totalRent":"680.23"'] ['"obj_yearConstructed":"2015"'] ['"obj_electricityKwhPrice":"0.1985"'] ['"obj_firingTypes":"no_information"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"n"'] ['"obj_yearConstructedRange":"8"'] ['"obj_baseRent":"441.03"'] ['"obj_livingSpace":"75"'] ['"obj_condition":"no_information"'] ['"obj_interiorQual":"no_information"'] ['"obj_petsAllowed":"no_information"'] ['"obj_lift":"n"']
6023 ['"obj_heatingType":"central_heating"'] ['"obj_balcony":"n"'] ['"obj_electricityBasePrice":"90.76"'] ['"obj_picturecount":"9"'] [] [] ['"obj_totalRent":"1188"'] ['"obj_yearConstructed":"1903"'] ['"obj_electricityKwhPrice":"0.1985"'] ['"obj_firingTypes":"no_information"'] ['"obj_hasKitchen":"y"'] ['"obj_cellar":"y"'] ['"obj_yearConstructedRange":"1"'] ['"obj_baseRent":"938"'] ['"obj_livingSpace":"67"'] ['"obj_condition":"well_kept"'] ['"obj_interiorQual":"sophisticated"'] ['"obj_petsAllowed":"no"'] ['"obj_lift":"n"']
12239 ['"obj_heatingType":"central_heating"'] ['"obj_balcony":"n"'] ['"obj_electricityBasePrice":"90.76"'] ['"obj_picturecount":"8"'] ['"obj_telekomDownloadSpeed":"100 MBit/s"'] ['"obj_telekomUploadSpeed":"40 MBit/s"'] ['"obj_totalRent":"685"'] ['"obj_yearConstructed":"1970"'] ['"obj_electricityKwhPrice":"0.1985"'] ['"obj_firingTypes":"oil"'] ['"obj_hasKitchen":"n"'] ['"obj_cellar":"y"'] ['"obj_yearConstructedRange":"2"'] ['"obj_baseRent":"485"'] ['"obj_livingSpace":"65"'] ['"obj_condition":"well_kept"'] ['"obj_interiorQual":"normal"'] ['"obj_petsAllowed":"negotiable"'] ['"obj_lift":"n"']

df.sample()

Description

Allows one to randomly sample from a DataFrame or pandas.Series. It was used several times in the examples so far, in order to give a better glimpse of the data in the df. Alternatives would have been, among others, df.head() and df.tail(). The main reason df.sample() was preferred over these alternatives is the reason that by using df.sample() one gets a subset of rows or columns of the data frame that are not constricted to either being at the very beginning of the index in the case of df.head() or at the very end of the index, if df.tail() is used. The subset that df.sample() produces might not have anything over the ones produced by df.head() or df.tail(), since it is a random sample after all. It is advised to specify a value for a seed that is used used whenever any kind of random element is part of command. In the case of df.sample(), one can pass a random seed in several different ways. Here, only a integer value was needed (seed = 42), as defined along the imports needed for this article. Parameter random_state takes the value of the random seed (random_state=seed). Specifying a seed has the benefit to make the output consistent and most importantly reproducible when run several times by one self or by anyone else executing the file again.

This marks the end of this article. Steps from how to create a new DataFrame from several smaller DataFrames were covered, before various tools in the pandas library were showcased by describing each one along with using examples to show they can be used in projects.




Mastery in Pandas: In-Depth Data Exploration, Part 1
PyJanitor Proficiency: Efficient String Data Cleaning, Part 2
Geospatial Engineering in Pandas: Creating Valid GPS Columns, Part 3
Advanced Data Cleaning and Validation: Batch Processing with Pandas, Part 4