Data Structures
Pandas
Pandas documentation lists ways to enhance the performances in terms of both query and storage.
Storage & Memory
We can reduce the memory required to store the dataframes if we reduce the datatype to its minimal or appropriate ones. This can also make faster queries if categorical dtypes are assigned properly.
def reduce_df(df, cat_col=[], cat_percent=0.5, verbose=True):
""""maximium data memory optimization
Args:
cat_col (list): list of columm names to convert to category dtype
cat_percent (float): only convert if percentage duplicates more than stated
verbose (bool): prints reduction percentage
"""
mem_start = df.memory_usage().sum()
for col in df.columns:
dtype_ = df[col].dtype
if dtype_ == float:
df[col] = pd.to_numeric(df[col], downcast='float')
elif dtype_ == int:
min_ = df[col].min()
if min_ > 0:
df[col] = pd.to_numeric(df[col], downcast='unsigned')
else:
df[col] = pd.to_numeric(df[col], downcast='integer')
elif dtype_ == object:
if col in cat_col:
dup_percent = 1 - (len(df[col].unique()) / size)
if dup_percent >= cat_percent:
df[col] = df[col].astype('category')
if verbose:
mem_end = df.memory_usage().sum()
print("{:.2f}% of original size is reduced".format((mem_start-mem_end) / mem_start * 100))
return df
Looping
As a rule of thumb, never use crude for loops to iterate and change values in the dataframe. The apply
method is already internally optimized, like using iterators in Cython.
We can also used vectorization on pandas series or numpy arrays as suggested by this author.
import numpy as np
import pandas as pd
# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
MILES = 3959
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
total_miles = MILES * c
return total_miles
# optimized on apply-lambda
df['distance'] = df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)
# optimized pandas series
df['distance'] = haversine(40.671, -73.985, df['latitude'], df['longitude'])
# optimized numpy array
df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)
Querying
In pandas, there are various ways of querying, and the speed various with each type.
In the example below, the fastest way is to convert the query column into a category data-type as it contains many duplicates, hence conversion into integers underthehood make it query more efficient. However, this might not always be the case, and using isin
or in1d
might suffice.
More information here.
# Unoptimized, 0.01288s
result = df[df["antecedents"] == item]
# Optimized Queries
# -------------------
# with pandas .query, 0.00722s, x1.8
result = df.query("antecedents == @item")
# with isin, 0.00440s, x2.9
result = df[df["antecedents"].isin([item])]
# with indexing, 0.004342s x3.0
df = df.set_index("antecedents")
result = df[df.index.isin([item])]
# with numpy, 0.004175s, x3.1
result = df[np.in1d(df["antecedents"].values, [item])]
# using category datatype, 0.000608s, x211.8
df["antecedents"] = df["antecedents"].astype('category')
result = df[df["antecedents"] == item]
Graph
A graph data structure enables much faster query if there is a network of interconnectivity between each other, known as nodes. Information stored between each connection of two nodes are called links.
A simple graph structure in python can be created using the networkx library. However, this does not scale when the data become huge. For that, a graph database like neo4j is more appropriate.
import networkx as nx
import pandas as pd
df = pd.read_csv("association_rules.csv")
# convert dataframe to networkx graph
graph = nx.convert_matrix.from_pandas_edgelist(
df, "antecedents", "consequents",
[
"antecedent support",
"consequent support",
"support",
"confidence",
"lift",
"leverage",
"conviction",
],
create_using=nx.DiGraph,
)
# query
result = list(graph[item])