This article showcases my expertise in using pandas for advanced data exploration. It focuses on analyzing a 47-column dataset, providing insights into leveraging pandas for complex tabular data management.
This series demonstrates my deep expertise in pandas and pyjanitor for advanced data exploration and cleaning. In Part 1, “Mastery in Pandas: In-Depth Data Exploration, Part 1,” a 47-column dataset is analyzed to showcase complex tabular data management. Dimensionality reduction techniques are applied to retain only relevant columns. Part 2, “PyJanitor Proficiency: Efficient String Data Cleaning, Part 2,” leverages pyjanitor’s method chaining syntax to enhance the speed and efficiency of string data cleaning. The focus is on unique column values to guide the cleaning steps.
Regular expressions (regex) are extensively used in Part 4, “Advanced Data Cleaning and Validation: Batch Processing with Pandas, Part 4,” for extracting and validating cell contents, replacing invalid patterns with valid ones. Additionally, this part emphasizes handling large volumes of tabular data through batch processing. Part 3, “Geospatial Engineering in Pandas: Creating Valid GPS Columns, Part 3,” highlights the creation and validation of
Point
geometrygps
columns by merging longitude and latitude fields into GPS data columns. The series culminates with the creation of newtimedelta64[ns]
time_listed
andgps
columns, illustrating advanced data cleaning and validation techniques.
Mastery in Pandas: In-Depth Data Exploration, Part 1
PyJanitor Proficiency: Efficient String Data Cleaning, Part 2
Geospatial Engineering in Pandas: Creating Valid GPS Columns, Part 3
Advanced Data Cleaning and Validation: Batch Processing with Pandas, Part 4
The input data, crucial for analyzing rental listings in Hamburg, is divided into three CSV files. These files collectively encompass all rental listings available on ‘ImmosScout24’, a prominent German rental listing website, within the city boundaries at the time of scraping. The official website can be found at ImmoScout24.
The primary goal at this stage is to clean and prepare a tidy DataFrame, setting the foundation for subsequent stages such as Feature Engineering, Exploratory Data Analysis, Machine Learning, and ultimately presenting the solution to stakeholders.
First, we import the necessary Python libraries and set a seed for reproducibility:
import pandas as pd # Library for DataFrame manipulation
seed = 42 # Set seed for reproducible random outputs
The paths to the scraping data are defined, and DataFrames are created using a custom function read_and_prepare_data
. This function enhances readability, ensures consistent data type handling, and follows the principle of DRY (Don’t Repeat Yourself). The low_memory=False
option is used in pd.read_csv()
to avoid dtype guessing, ensuring more accurate data types, particularly for columns with mixed types.
The DataFrames are then concatenated, and duplicate rows are removed. This approach is more efficient and cleaner than appending DataFrames.
# File paths for the scraping data
scraping_1 = "../data/20181203-first_scraping_topage187.csv"
scraping_2 = "../data/20181203-second_scraping_topage340.csv"
scraping_3 = "../data/20181203-third_scraping_topage340.csv"
def read_and_prepare_data(file_path: str) -> pd.DataFrame:
"""
Reads and processes CSV data from a given file path.
Uses low_memory=False for accurate data type detection.
"""
return pd.read_csv(file_path, index_col=False, parse_dates=False, low_memory=False)
# Reading and processing the data
df1 = read_and_prepare_data(scraping_1)
df2 = read_and_prepare_data(scraping_2)
df3 = read_and_prepare_data(scraping_3)
# Concatenating and cleaning dataframes
df = pd.concat([df1, df2, df3], ignore_index=True).drop(columns=["Unnamed: 0"]).drop_duplicates()
The updated approach adheres to best practices in Python programming, ensuring efficiency, readability, and maintainability of the code. The initial data preprocessing sets a strong foundation for the upcoming data analysis and machine learning stages.
To get a first look at the newly created DataFrame df
, one can choose between
multiple tools in the pandas library. It is assumed that the Dataframe is named
df
in the following, as a couple of the tools, the pandas library has to
offer, are described and links to the documentation page of each command are
added for more detail on how each command works.
df.head()
df.head()
is df.tail()
df.columns
df.index
df.describe()
df.shape
df.count()
df.nunique()
df.value_counts()
df.filter()
df.sample()
df.head()
The command df.head()
returns the first 5 rows of the DataFrame by default, if
no parameters are specified by the user. Using the parameter n, one can
specify the number of rows that get returned. Needless to say, rows returned
will always start at the first index value and include the following n-1 rows.
In the first call to df.head()
, the default value for number of lines printed
(n=5) is used by not specifying any parameter value in the function call. In
the second call, the number of lines printed is changed to n =9.
df.head() # Includes index values [0:4], which is default (n=5)
einbau_kue | lift | nebenkosten | gesamt_miete | heiz_kosten | lat | lng | str | nutzf | regio | ... | json_firingTypes | json_hasKitchen | json_cellar | json_yearConstructedRange | json_baseRent | json_livingSpace | json_condition | json_interiorQual | json_petsAllowed | json_lift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Einbauküche | NaN | 190,84 | 541,06 | inkl. 78 € | ['lat: 53.52943934146104,'] | ['lng: 10.152357145948395'] | Strietkoppel 20 | NaN | 22115 Hamburg Billstedt | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"3"'] | ['"obj_baseRent":"350.22"'] | ['"obj_livingSpace":"76"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
1 | NaN | NaN | 117,95 | 559,05 | inkl. 52,07 € | ['lat: 53.58414266878421,'] | ['lng: 10.06842724545814'] | Naumannplatz 2 | NaN | 22049 Hamburg Dulsberg | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"1"'] | ['"obj_baseRent":"441.1"'] | ['"obj_livingSpace":"60"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
2 | NaN | NaN | 249,83 | 839,01 | inkl. 110,58 € | ['lat: 53.6044918821393,'] | ['lng: 9.86423115803761'] | Warthestr. 52a | NaN | 22547 Hamburg Lurup | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"5"'] | ['"obj_baseRent":"589.18"'] | ['"obj_livingSpace":"75"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
3 | Einbauküche | NaN | 70 | 665 (zzgl. Heizkosten) | nicht in Nebenkosten enthalten | ['lat: 53.56394970119381,'] | ['lng: 9.956881419437474'] | Oelkersallee 53 | NaN | 22769 Hamburg Altona-Nord | ... | ['"obj_firingTypes":"no_information"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"595"'] | ['"obj_livingSpace":"46"'] | ['"obj_condition":"well_kept"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
4 | Einbauküche | NaN | 213,33 | 651,81 | inkl. 57,78 € | ['lat: 53.60180649336605,'] | ['lng: 10.081257248271337'] | Haldesdorfer Str. 119a | NaN | 22179 Hamburg Bramfeld | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"6"'] | ['"obj_baseRent":"438.48"'] | ['"obj_livingSpace":"52"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
5 rows × 47 columns
df.head(n=3) # Includes index values [0:2]
einbau_kue | lift | nebenkosten | gesamt_miete | heiz_kosten | lat | lng | str | nutzf | regio | ... | json_firingTypes | json_hasKitchen | json_cellar | json_yearConstructedRange | json_baseRent | json_livingSpace | json_condition | json_interiorQual | json_petsAllowed | json_lift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Einbauküche | NaN | 190,84 | 541,06 | inkl. 78 € | ['lat: 53.52943934146104,'] | ['lng: 10.152357145948395'] | Strietkoppel 20 | NaN | 22115 Hamburg Billstedt | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"3"'] | ['"obj_baseRent":"350.22"'] | ['"obj_livingSpace":"76"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
1 | NaN | NaN | 117,95 | 559,05 | inkl. 52,07 € | ['lat: 53.58414266878421,'] | ['lng: 10.06842724545814'] | Naumannplatz 2 | NaN | 22049 Hamburg Dulsberg | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"1"'] | ['"obj_baseRent":"441.1"'] | ['"obj_livingSpace":"60"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
2 | NaN | NaN | 249,83 | 839,01 | inkl. 110,58 € | ['lat: 53.6044918821393,'] | ['lng: 9.86423115803761'] | Warthestr. 52a | NaN | 22547 Hamburg Lurup | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"5"'] | ['"obj_baseRent":"589.18"'] | ['"obj_livingSpace":"75"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
3 rows × 47 columns
df.tail()
The command df.tail()
is the counterpart to df.head()
, it returns the last 5
rows of the DataFrame by default, if no parameters are specified by the user.
Using the parameter n, one can specify the number of rows that get returned.
Needless to say, rows returned will always end with the row at the last index
value and include the preceding n-1 rows.
First the maximum of the index of df
is checked, to show that the last printed
row is indeed the last value in the index of the DataFrame, other than that the
examples mirror the two from the df.head()
command, to display their
similarities.
print('The maximum value of the range index of df is %s' % df.index.max())
df.tail()
The maximum value of the range index of df is 12494
einbau_kue | lift | nebenkosten | gesamt_miete | heiz_kosten | lat | lng | str | nutzf | regio | ... | json_firingTypes | json_hasKitchen | json_cellar | json_yearConstructedRange | json_baseRent | json_livingSpace | json_condition | json_interiorQual | json_petsAllowed | json_lift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12490 | Einbauküche | NaN | 80 | 1.058 | 47 € | ['lat: 53.586938610218276,'] | ['lng: 10.016258235901141'] | Goldbekufer 29 | 8 m² | 22303 Hamburg Winterhude | ... | ['"obj_firingTypes":"oil"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"y"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"931"'] | ['"obj_livingSpace":"66.5"'] | ['"obj_condition":"mint_condition"'] | ['"obj_interiorQual":"sophisticated"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
12491 | Einbauküche | NaN | 61 | 674 | 26 € | ['lat: 53.55758333359278,'] | ['lng: 10.03986901076397'] | Burgstraße 34 | NaN | 20535 Hamburg Hamm-Nord | ... | ['"obj_firingTypes":"oil"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"y"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"587"'] | ['"obj_livingSpace":"51"'] | ['"obj_condition":"refurbished"'] | ['"obj_interiorQual":"sophisticated"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
12492 | Einbauküche | NaN | 76 | 752 | 70 € | ['lat: 53.6486450531966,'] | ['lng: 10.039966464612842'] | Bei der Ziegelei 4 | 8 m² | 22339 Hamburg Hummelsbüttel | ... | ['"obj_firingTypes":"oil"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"y"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"606"'] | ['"obj_livingSpace":"63.69"'] | ['"obj_condition":"refurbished"'] | ['"obj_interiorQual":"sophisticated"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
12493 | Einbauküche | Personenaufzug | 59,93 | 382,73 | 21,62 € | ['lat: 53.54886472789119,'] | ['lng: 10.079737639604678'] | Culinstraße 58 | NaN | 22111 Hamburg Horn | ... | ['"obj_firingTypes":"district_heating"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"301.18"'] | ['"obj_livingSpace":"30.89"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"y"'] |
12494 | NaN | NaN | 101 | 499,91 | in Nebenkosten enthalten | ['lat: 53.46145736469006,'] | ['lng: 9.966232033537533'] | Denickestraße 42 b | NaN | 21075 Hamburg Harburg | ... | [] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"398.91"'] | ['"obj_livingSpace":"46.93"'] | ['"obj_condition":"well_kept"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
5 rows × 47 columns
df.columns
The command df.columns
does one thing and one thing well, one might say. It
returns a list of strings, the list of the columns of the DataFrame. Its output
can be iterated through, in order to select subsets of all columns. An iteration
can be done in the form of a list comprehension that makes use of conditional
clauses for example. It also helps one find problematic column names that need
to be changed in order to qualify as tidy. The output of it also helps one get
an overview of all the columns names and therefore is a starting point for
dropping certain columns and renaming the columns, so they follow an easy to
remember and precise naming pattern.
Below, the set of columns of the DataFrame are printed. One can see that there are two types of patterns found in the names of the columns.
There are several other differences between the sets, as we will see later.
df.columns
Index(['einbau_kue', 'lift', 'nebenkosten', 'gesamt_miete', 'heiz_kosten',
'lat', 'lng', 'str', 'nutzf', 'regio', 'parking', 'online_since',
'baujahr', 'objekt_zustand', 'heizungsart', 'wesent_energietr',
'endenergiebedarf', 'kaltmiete', 'quadratmeter', 'anzahl_zimmer',
'balkon/terasse', 'keller', 'typ', 'etage', 'anz_schlafzimmer',
'anz_badezimmer', 'haustiere', 'nicht_mehr_verfg_seit',
'json_heatingType', 'json_balcony', 'json_electricityBasePrice',
'json_picturecount', 'json_telekomDownloadSpeed',
'json_telekomUploadSpeed', 'json_totalRent', 'json_yearConstructed',
'json_electricityKwhPrice', 'json_firingTypes', 'json_hasKitchen',
'json_cellar', 'json_yearConstructedRange', 'json_baseRent',
'json_livingSpace', 'json_condition', 'json_interiorQual',
'json_petsAllowed', 'json_lift'],
dtype='object')
df.index
The command df.index
prints the type of the index of the DataFrame, as well as
a couple of index values from the beginning and end of the 64bit integer index
range in our example. When the final DataFrame df
was created, using the
following line of code:
df = df1.append([df2, df3], ignore_index=True)
The ignore_index=True
part was important to make sure that the range indexes
of each of the appended DataFrames df2
and df3
would not simply get stacked
on top of the index of df1
. Would that have happened the resulting index would
have been unusable, since there would not have been a monotonously increasing
range index in the resulting DataFrame.
In the example it is shown what the resulting index of the final DataFrame would
have been, if parameter * ignore_index* would not have been specified at all
(df_index_1
) and what it would have been, given ignore_index=False
(df_index_2
). The resulting index is the same in both cases and it is
important that one knows exactly how the index of any DataFrame looks like, in
order to be able to manipulate and clean it. The resulting range index of the
DataFrame, given ignore_index=True
is used in the input statement shows all
the qualities a simple range index should have.
df_index_1 = df1.append([df2, df3])
df_index_2 = df1.append([df2, df3], ignore_index=False)
print('The resulting index, if no value is specified:\n %s\n ' % df_index_1.index)
print('The resulting index, if False is used:\n %s\n ' % df_index_2.index)
The resulting index, if no value is specified:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
5582, 5583, 5584, 5585, 5586, 5587, 5588, 5589, 5590, 5591],
dtype='int64', length=12495)
The resulting index, if False is used:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
5582, 5583, 5584, 5585, 5586, 5587, 5588, 5589, 5590, 5591],
dtype='int64', length=12495)
print('The resulting index, if True is used:\n %s\n ' % df.index)
The resulting index, if True is used:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
12485, 12486, 12487, 12488, 12489, 12490, 12491, 12492, 12493,
12494],
dtype='int64', length=12325)
df.describe()
The command df.describe()
gives summary statistics for all columns that are
of a numerical data type (dtype) by default. In the default case, the
following statistics are included in the output for each included column. The
following notation will be used from this point onwards: \(np.nan\) stands for
missing values and \(\neg np.nan\) stands for non missing values.
The data types (dtypes) in the DataFrame are checked, before df.describe()
is explored for df
.
df.dtypes
einbau_kue object
lift object
nebenkosten object
gesamt_miete object
heiz_kosten object
lat object
lng object
str object
nutzf object
regio object
parking object
online_since object
baujahr object
objekt_zustand object
heizungsart object
wesent_energietr object
endenergiebedarf object
kaltmiete object
quadratmeter object
anzahl_zimmer object
balkon/terasse object
keller object
typ object
etage object
anz_schlafzimmer float64
anz_badezimmer float64
haustiere object
nicht_mehr_verfg_seit object
json_heatingType object
json_balcony object
json_electricityBasePrice object
json_picturecount object
json_telekomDownloadSpeed object
json_telekomUploadSpeed object
json_totalRent object
json_yearConstructed object
json_electricityKwhPrice object
json_firingTypes object
json_hasKitchen object
json_cellar object
json_yearConstructedRange object
json_baseRent object
json_livingSpace object
json_condition object
json_interiorQual object
json_petsAllowed object
json_lift object
dtype: object
From the output one can see that there only 2 columns that exclusively hold numerical data and thus have a numerical * data type* (dtype). All other columns have mixed dtypes, so pandas labels them as having dtype ‘object’. In the following, all columns will be checked and their dtypes might change in the process of cleaning them.
With the information that only 2 columns have a numerical dtype, calling
df.describe()
with no further parameters specified, will print the summary
statistics listed above only for those two columns. See the output below, for
more details.
df.describe()
anz_schlafzimmer | anz_badezimmer | |
---|---|---|
count | 6469.000000 | 7317.000000 |
mean | 1.579379 | 1.096351 |
std | 0.746926 | 0.321261 |
min | 0.000000 | 0.000000 |
25% | 1.000000 | 1.000000 |
50% | 1.000000 | 1.000000 |
75% | 2.000000 | 1.000000 |
max | 8.000000 | 11.000000 |
We will use the summary statistics for variable anz_schlafzimmer
as an example
of how one can interpret their values for a given data series. The variable
gives the number of bedrooms that a listing has, according to the data found in
the listing on the website.
df['anz_schlafzimmer'].value_counts()
, which prints a 2 column table. In the first column, all unique
values in the data series are listed. For each of them, the count is given in the same row, second column of the
table. \(np.nan\) are excluded. This knowledge helps one to understand that the mean \(\bar{x} \approx 1.58\) signals,
that there are many listings that have two or less bedrooms.The distributions will be analyzed at a later stage, for now the focus is on getting a ‘first look’ at the DataFrame.
Below one finds the value counts for variable anz_schlafzimmer, which describes the number of bedrooms found in each listing.
df['anz_schlafzimmer'].value_counts()
1.0 3544
2.0 2207
3.0 594
4.0 97
5.0 15
0.0 10
8.0 1
6.0 1
Name: anz_schlafzimmer, dtype: int64
df.shape
The command returns a tuple object which has two numerical values. Let $(x,y)$
be the output of df.shape
, a tuple object where \(x\) gives the number of rows
df
has, while \(y\) gives the number of columns of it.
See below for the created df
.
df.shape
(12325, 47)
df.count()
df.count()
returns the count of all \(\neg np.nan\) for each column or for a
subset.
In the example, the output was shortened by only including 4 randomly selected
columns out of the 47 columns in the df
.
df.count().sample(4,random_state=seed)
nicht_mehr_verfg_seit 11629
json_cellar 12324
haustiere 3914
json_condition 12324
dtype: int64
df.nunique()
Returns an integer value that gives the number of unique values in the data
frame or of a subset of columns in the df
. It does not return the unique
values themselves.
The example shows how it can be applied to df
and what the output looks like.
A subset of the columns is used again, to keep the output readable.
df.nunique().sample(4,random_state=seed)
nicht_mehr_verfg_seit 701
json_cellar 2
haustiere 3
json_condition 10
dtype: int64
df.filter()
df.filter()
can be used like df.loc
, if parameter items is added. It
prints the columns specified as a list of column names. However, where it shines
is when there are subsets of columns that have a certain pattern in their names.
In this case, one can use parameter regex, followed by a regex pattern along
with the parameter and value axis=1.
In the first example df.filter()
is used like df.loc
to select a subset of
two columns. The second example shows how regex can be used to filter certain
columns. As mentioned earlier, in the DataFrame constructed there is a subset of
columns, whose names all begin with the prefix ‘json_’. Using regex, makes it
easy to filter out these columns.
Example 1 - Using df.filter()
to select a subset of columns.
df.filter(items=['lift','str']).sample(4,random_state=seed)
lift | str | |
---|---|---|
9939 | NaN | NaN |
1217 | NaN | Alter Güterbahnhof 10a |
6023 | NaN | Jütlandring 48 |
12239 | NaN | Estebogen 22 |
Example 2 - Using df.filter()
to select all columns that have the prefix ‘json_’ in their names.
df.filter(regex='^json', axis=1).sample(4,random_state=seed)
json_heatingType | json_balcony | json_electricityBasePrice | json_picturecount | json_telekomDownloadSpeed | json_telekomUploadSpeed | json_totalRent | json_yearConstructed | json_electricityKwhPrice | json_firingTypes | json_hasKitchen | json_cellar | json_yearConstructedRange | json_baseRent | json_livingSpace | json_condition | json_interiorQual | json_petsAllowed | json_lift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9939 | [] | ['"obj_balcony":"y"'] | ['"obj_electricityBasePrice":"90.76"'] | ['"obj_picturecount":"12"'] | ['"obj_telekomDownloadSpeed":"16 MBit/s"'] | ['"obj_telekomUploadSpeed":"2,4 MBit/s"'] | ['"obj_totalRent":"1000"'] | ['"obj_yearConstructed":"1983"'] | ['"obj_electricityKwhPrice":"0.1985"'] | ['"obj_firingTypes":"no_information"'] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"4"'] | ['"obj_baseRent":"750"'] | ['"obj_livingSpace":"87"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
1217 | [] | ['"obj_balcony":"y"'] | ['"obj_electricityBasePrice":"90.76"'] | ['"obj_picturecount":"5"'] | [] | [] | ['"obj_totalRent":"680.23"'] | ['"obj_yearConstructed":"2015"'] | ['"obj_electricityKwhPrice":"0.1985"'] | ['"obj_firingTypes":"no_information"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"n"'] | ['"obj_yearConstructedRange":"8"'] | ['"obj_baseRent":"441.03"'] | ['"obj_livingSpace":"75"'] | ['"obj_condition":"no_information"'] | ['"obj_interiorQual":"no_information"'] | ['"obj_petsAllowed":"no_information"'] | ['"obj_lift":"n"'] |
6023 | ['"obj_heatingType":"central_heating"'] | ['"obj_balcony":"n"'] | ['"obj_electricityBasePrice":"90.76"'] | ['"obj_picturecount":"9"'] | [] | [] | ['"obj_totalRent":"1188"'] | ['"obj_yearConstructed":"1903"'] | ['"obj_electricityKwhPrice":"0.1985"'] | ['"obj_firingTypes":"no_information"'] | ['"obj_hasKitchen":"y"'] | ['"obj_cellar":"y"'] | ['"obj_yearConstructedRange":"1"'] | ['"obj_baseRent":"938"'] | ['"obj_livingSpace":"67"'] | ['"obj_condition":"well_kept"'] | ['"obj_interiorQual":"sophisticated"'] | ['"obj_petsAllowed":"no"'] | ['"obj_lift":"n"'] |
12239 | ['"obj_heatingType":"central_heating"'] | ['"obj_balcony":"n"'] | ['"obj_electricityBasePrice":"90.76"'] | ['"obj_picturecount":"8"'] | ['"obj_telekomDownloadSpeed":"100 MBit/s"'] | ['"obj_telekomUploadSpeed":"40 MBit/s"'] | ['"obj_totalRent":"685"'] | ['"obj_yearConstructed":"1970"'] | ['"obj_electricityKwhPrice":"0.1985"'] | ['"obj_firingTypes":"oil"'] | ['"obj_hasKitchen":"n"'] | ['"obj_cellar":"y"'] | ['"obj_yearConstructedRange":"2"'] | ['"obj_baseRent":"485"'] | ['"obj_livingSpace":"65"'] | ['"obj_condition":"well_kept"'] | ['"obj_interiorQual":"normal"'] | ['"obj_petsAllowed":"negotiable"'] | ['"obj_lift":"n"'] |
df.sample()
Allows one to randomly sample from a DataFrame or pandas.Series
. It was used
several times in the examples so far, in order to give a better glimpse of the
data in the df
. Alternatives would have been, among others, df.head()
and
df.tail()
. The main reason df.sample()
was preferred over these alternatives
is the reason that by using df.sample()
one gets a subset of rows or columns
of the data frame that are not constricted to either being at the very
beginning of the index in the case of df.head()
or at the very end of the
index, if df.tail()
is used. The subset that df.sample()
produces might not
have anything over the ones produced by df.head()
or df.tail()
, since it is
a random sample after all. It is advised to specify a value for a seed that
is used used whenever any kind of random element is part of command. In the case
of df.sample()
, one can pass a random seed in several different ways. Here,
only a integer value was needed (seed = 42
), as defined along the imports
needed for this article. Parameter random_state takes the value of the random
seed (random_state=seed
). Specifying a seed has the benefit to make the output
consistent and most importantly reproducible when run several times by one self
or by anyone else executing the file again.
This marks the end of this article. Steps from how to create a new DataFrame from several smaller DataFrames were covered, before various tools in the pandas library were showcased by describing each one along with using examples to show they can be used in projects.
Mastery in Pandas: In-Depth Data Exploration, Part 1
PyJanitor Proficiency: Efficient String Data Cleaning, Part 2
Geospatial Engineering in Pandas: Creating Valid GPS Columns, Part 3
Advanced Data Cleaning and Validation: Batch Processing with Pandas, Part 4