DuckDB - Bringing Big Data To Your Laptop

Exploring DuckDB
The Data
DuckDB Queries
Final Thoughts
DuckDB Performance

TL;DR: DuckDB is online analytical processing (OLAP) database management system with a python library providing the ability to quickly perform initial Extract Transform Load (ETL) steps on larger than memory datasets without having to resort to heavier solutions like Hadoop/Spark. It’s a useful tool to extend analysis typically performed in pandas to larger collections of data, especially in parquet format.

Exploring DuckDB - Tons of Taxis

This post documents an exploratory analysis of the DuckDB python library to process big data (roughly defined as data which is too large to fit into memory). The candidate dataset used is the great info on taxi trips provided by the New York City Taxi and Limousine Commission. The data is in parquet format, which is an open source column based format “designed for efficent data storage and retrieval”.

In this exploration, I look at the evolution of fare and payment type for NYC yellow taxis from 2009 to present day using DuckDB. While the results are not too surprising (fares are more expensive, more people use credit cards than cash now, and total number of yellow taxis rides is on the decline), I came away very impressed with the performance of DuckDB queries and the usefulness of its python package.

Import DuckDB and other python libraries

import duckdb
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import matplotlib.pylab as pl
import pandas as pd
import time
import urllib
import os.path
import numpy as np

Getting the Data

Data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page is provided on a monthly basis all the way back to 2009. Investigating the url’s from the website, we can see that they follow the format https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_<YYYY>-<MM>.parquet, where <YYYY> is the year (e.g. 2021) and <MM> is the month (e.g. 02 for February). So first let’s make a pandas dataframe which stores the appropriate link, month, and year.

Create dataframe relate links to months/years

# Utilize the % operator for sprintf like formatting
BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_%d-%02d.parquet"
taxi_years = list(range(2009,2024,1))
months_as_nums =  list(range(1,13,1))

taxi_links = []
for year in taxi_years:
    for month in months_as_nums:
        taxi_links.append(
            {
                'year':year,
                'month':month,
                'data_url':BASE_URL % (year,month)
            }
        )

taxi_df = pd.DataFrame.from_records(taxi_links)

#Remove months that aren't available online/haven't happened yet
taxi_df = taxi_df[~((taxi_df['year']==2023) & (taxi_df['month']>7))] 

taxi_df.head()

	year	month	data_url
0	2009	1	https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet
1	2009	2	https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-02.parquet
2	2009	3	https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-03.parquet
3	2009	4	https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-04.parquet
4	2009	5	https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-05.parquet

Downloading the Data

Here's the code to download the data to the local folder. Local paths are added to taxi_df dataframe in the filename column.

Next step is to download the data using the urllib package. Th below code downloads the data to the current folder with the same naming convention as provided by the NYC Taxi & Limousine Commision.

for url in taxi_df.data_url:
    #check to see if file is downloaded in-case rerunning script
    if not(os.path.isfile(url.split('/')[-1])):
        urllib.request.urlretrieve(url, url.split('/')[-1])

The local files can also be added to pandas dataframe used to track the urls.

taxi_df['filename'] = [url.split('/')[-1] for url in taxi_df.data_url]

Once the data is downloaded, we can check the total size of the dataset by accessing the parquet metadata, more info available here.

total_size_GB = duckdb.execute(f"""
SELECT SUM(total_compressed_size)/(1024*1024*1024) FROM parquet_metadata({list(taxi_df.filename)})
""").fetchall()[0][0]

total_size_uncompressed_GB = duckdb.execute(f"""
SELECT SUM(total_uncompressed_size)/(1024*1024*1024) FROM parquet_metadata({list(taxi_df.filename)})
""").fetchall()[0][0]

print(f"Total size of all files (compressed size): \t {total_size_GB:.2f} GB")
print(f"Total uncompressed size (pandas in memory): \t {total_size_uncompressed_GB:.2f} GB")

Total size of all files (compressed size): 	 27.99 GB
Total uncompressed size (pandas in memory): 	 59.40 GB

With a total size of all files about 28 GB, this dataset is too large to store in memory (at least on my personal computer). Especially since the uncompressed size is over twice that size at 59.4 GB. It also highlights that parquet is a compressed data format, which can save disk space for columns with many repeating or empty values.

Investigate Columns - What’s The Data?

With the data downloaded, let’s take look at the schema and see if it is consistent across all of the parquet files. If the schema does differ, DuckDB provides the ability to handling unions, with either union_by_name or union_by_position (see more here).

col_vals = duckdb.execute(f"""
SELECT file_name, path_in_schema FROM parquet_metadata({list(taxi_df.filename)})
""").df()

We can see which columns are available in each dataset. Let’s group by case insensitive column names, since SQL column variables are case insensitive.

grouped_col_names = col_vals.groupby(col_vals['path_in_schema'].str.lower())\
                            .agg(['unique'])\
                            .reset_index()

grouped_col_names.head()

	path_in_schema	file_name	path_in_schema
		unique	unique
0	__index_level_0__	[yellow_tripdata_2010-02.parquet, yellow_tripd...	[__index_level_0__]
1	airport_fee	[yellow_tripdata_2011-01.parquet, yellow_tripd...	[airport_fee, Airport_fee]
2	congestion_surcharge	[yellow_tripdata_2011-01.parquet, yellow_tripd...	[congestion_surcharge]
3	dolocationid	[yellow_tripdata_2011-01.parquet, yellow_tripd...	[DOLocationID]
4	dropoff_datetime	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[dropoff_datetime]

And we can find the coverage of variable names across datasets by checking which files each column appears in.

all_files = set(taxi_df.filename.values)

coverage =  [
    len(
        all_files.intersection(set(grouped_col_names['file_name']['unique'].iloc[i]))
    )/len(all_files)
    for i in range(grouped_col_names.shape[0])
]

grouped_col_names['coverage'] = coverage

We can see columns that appear in the largest fraction of parquet files.

grouped_col_names.sort_values(by='coverage',ascending=False).head(10)

	path_in_schema	file_name	path_in_schema	coverage
		unique	unique
35	trip_distance	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Trip_Distance, trip_distance]	1.000000
15	payment_type	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Payment_Type, payment_type]	1.000000
14	passenger_count	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Passenger_Count, passenger_count]	1.000000
13	mta_tax	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[mta_tax]	1.000000
10	fare_amount	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[fare_amount]	0.931429
27	tip_amount	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[tip_amount]	0.931429
25	store_and_fwd_flag	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[store_and_fwd_flag]	0.931429
31	total_amount	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[total_amount]	0.931429
29	tolls_amount	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[tolls_amount]	0.931429
21	ratecodeid	[yellow_tripdata_2011-01.parquet, yellow_tripd...	[RatecodeID]	0.862857

And those appearing the smallest fraction of parquet files.

grouped_col_names.sort_values(by='coverage',ascending=False).tail(10)

	path_in_schema	file_name	path_in_schema	coverage
		unique	unique
7	end_lat	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[End_Lat]	0.068571
30	tolls_amt	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Tolls_Amt]	0.068571
8	end_lon	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[End_Lon]	0.068571
28	tip_amt	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Tip_Amt]	0.068571
11	fare_amt	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Fare_Amt]	0.068571
16	pickup_datetime	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[pickup_datetime]	0.068571
17	pickup_latitude	[yellow_tripdata_2010-01.parquet, yellow_tripd...	[pickup_latitude]	0.068571
24	store_and_forward	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[store_and_forward]	0.068571
23	start_lon	[yellow_tripdata_2009-01.parquet, yellow_tripd...	[Start_Lon]	0.068571
0	__index_level_0__	[yellow_tripdata_2010-02.parquet, yellow_tripd...	[__index_level_0__]	0.011429

For reference the most recent documentation of the schema is provided by the NYC Taxi and Limousine Comission here, but it is obvious from the above investigation that this schema has changed over time.

DuckDB Queries

Up until now I’ve been dealing with the metadata of parquet files, which can be read into memory without reading in the entire parquet file. So the following demonstrations are the first which would be difficult to perform with this large of a dataset on a single machine without a streaming approach.

Number of entries

First we can do a quick count of how many rows we have in all of our parquet files

total_rows = duckdb.execute(f"""
SELECT COUNT(*) FROM read_parquet({list(taxi_df.filename)})
""").fetchall()[0][0]

print(f"Total number of rows {total_rows:.2e}")

Total number of rows 1.72e+09

So we have over one billion records of taxi trips over the past 14 years. That’s a lot of taxi rides!

Aggregation Methods

Next let’s get percentiles on the total_amount field. DuckDB has a collection of aggregation functions which perform calculations in parallel as the data is scanned, which does not require loading the full dataset into memory. We’ll start off with the approx_quantile method, which calculates quantiles using the new T-Digest method.

Since total_amount is not present in all parquet files, we need to pass the union_by_name=true to the read_parquet method.

# disable so it doesn't show up in markdown after nbconvert
duckdb.execute("PRAGMA disable_progress_bar;");

percentiles = np.linspace(0.01,0.99,num=99)

total_amount_pctiles = duckdb.execute(f"""
SELECT approx_quantile(total_amount,[{",".join([str(x) for x in percentiles])}]) 
FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
WHERE total_amount > 0
""").fetchall()[0][0]

Plot distribution of ride costs over entire dataset

plt.plot(total_amount_pctiles,percentiles,linewidth=3)
plt.xlabel('Ride Cost (Dollars)',size=16)
plt.ylabel('Percentile',size=16)
plt.grid(alpha=0.2)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.title('Ride Cost For Yellow Taxi over 2009-2023', size=16)
plt.show()

View evolution of ride cost on a per year basis

With that out of the way, we can use the year information in the filename to plot the evolution of the cost of ride cost over the years

pcts_over_years = dict()
for year in range(2010,2024,1):
    year_fnames = taxi_df[taxi_df['year']==year]['filename']
    year_pcts = duckdb.execute(f"""
    SELECT approx_quantile(total_amount,[{",".join([str(x) for x in percentiles])}]) 
    FROM read_parquet({list(year_fnames)},union_by_name=true)
    WHERE total_amount > 0""").fetchall()[0][0]
    year_dict = { str(year):year_pcts }
    pcts_over_years = dict(pcts_over_years, **year_dict)

Plot total fare distribution per year

colors = pl.cm.magma(np.linspace(.1,.7,len(pcts_over_years.keys())))

for idx, year in enumerate(pcts_over_years):
    plt.plot(pcts_over_years[year],percentiles,label=year,alpha=0.5,color=colors[idx],linewidth=3)

plt.title('Yellow Taxi Ride Cost Distribution 2010-2023',size=16)
plt.xlabel('Ride Cost (Dollars)',size=16)
plt.ylabel('Percentile',size=16)
plt.grid(alpha=0.2)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.legend()
plt.show()

The above plot shows that the nominal cost of yellow taxi rides in NYC has steadily rising, but has increased significantly in the year 2023. This fits with the trend of higher inflation over the past couple years. If we wanted to investigate the breakdown of the total fare over time (taxes, tips, fees, etc.), those columns are in the dataset as well.

Payment Types

Now let’s look at the counts of different payment types

payment_type_counts = duckdb.execute(f"""
SELECT payment_type, COUNT(*) as occurrence_count
FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
GROUP BY payment_type; """).df()

payment_type_counts.sort_values(by=['occurrence_count'],ascending=False)

	Payment_Type	occurrence_count
5	1	821291745
18	2	548052362
20	CASH	69117503
17	Cash	56282593
6	CSH	50210641
3	Credit	42561382
12	CRD	30829647
4	CAS	30792977
16	Cre	27416855
2	Cas	26058725
19	0	4773430
13	3	4440736
0	CRE	3370093
10	CREDIT	2330599
1	4	1949962
14	5	753370
9	No Charge	509194
22	No	200505
11	Dispute	94784
8	Dis	43614
21	NA	40013
15	NOC	31817
7	DIS	6275

Like any real dataset, we have some messy data which is the result of human error and changing schemas. Looking at the official documentation, there are currently only six types of payment types with a numerical code for each:

1 = Credit
2 = Cash
3 = No Charge
4 = Dispute
5 = Unknown
6 = Voided Trip

I’m interested in the trends of cash versus credit over time. Let’s hand clean this mess by matching the data to its intended value.

total_cash = duckdb.execute(f"""
SELECT COUNT(*) FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
WHERE payment_type IN (2, 'CASH','Cash','CSH', 'CAS', 'Cas')
""").fetchall()[0][0]

total_credit = duckdb.execute(f"""
SELECT COUNT(*) FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
WHERE payment_type IN (1, 'Credit', 'CRD', 'Cre','CRE', 'CREDIT')
""").fetchall()[0][0]

print(f"""
Total Cash Transactions: \t\t{total_cash:.2e}
Total Credit Transactions: \t\t{total_credit:.2e}
Fraction of Total Transactions: \t{(total_cash+total_credit)/total_rows:.6f}
""")

Total Cash Transactions: 		7.81e+08
Total Credit Transactions: 		9.28e+08
Fraction of Total Transactions: 	0.992538

So there are slighlty more credit tranasctions over the entire dataset than cash transactions. And predictably “Cash or Credit?” corresponding to 99.25% of all transactions. But how the proportion of cash versus credit evolve over time?

years = list(range(2010,2024,1))
cash_years = []
credit_years = []

for year in years:
    year_fnames = taxi_df[taxi_df['year']==year]['filename']
    
    cash_years.append(
        duckdb.execute(f"""
        SELECT COUNT(*) FROM read_parquet({list(year_fnames)},union_by_name=true)
        WHERE payment_type IN (2, 'CASH','Cash','CSH', 'CAS', 'Cas')
        """).fetchall()[0][0])

    credit_years.append(
        duckdb.execute(f"""
        SELECT COUNT(*) FROM read_parquet({list(year_fnames)},union_by_name=true)
        WHERE payment_type IN (1, 'Credit', 'CRD', 'Cre','CRE', 'CREDIT')
        """).fetchall()[0][0])    

Plot "Cash or Credit" annually over dataset"

# Total Transaction Volume
plt.plot(years,cash_years,label='Cash Transactions',linewidth=3)
plt.plot(years,credit_years,label='Credit Transactions',linewidth=3)
plt.plot(years,np.add(cash_years,credit_years),label='Total Transactions',linewidth=3)

plt.xlabel('Year', size=16)
plt.ylabel('Annual Transactions', size=16)
plt.title('NYC Taxi Rides - Cash vs. Credit Total Volume')
plt.legend()
plt.grid(alpha=0.2)
plt.xticks(size=14)
plt.yticks(size=14)
plt.show()

Plot "Cash or Credit" as percentage of all transactions

# Transaction Percentage
plt.plot(years,np.divide(cash_years,np.add(cash_years,credit_years)),label='Cash',linewidth=3)
plt.plot(years,np.divide(credit_years,np.add(cash_years,credit_years)),label='Credit',linewidth=3)

plt.xlabel('Year', size=16)
plt.ylabel('Transaction Percentage', size=16)
plt.title('NYC Taxi Rides - Cash vs. Credit Percentage')
plt.legend()
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.grid(alpha=0.2)
plt.xticks(size=14)
plt.yticks(size=14)
plt.show()

I’m Different - User Defined Functions

If the SQL queries and DuckDB aggregate functions are not enough for a given use case, DuckDB provides the ability to use User Defined Functions (UDF). With UDF’s, we can define pure python functions that are calculated as the data is continously read. So if the initial aggregation is still too large for memory, we can apply our own defined functions as the data is read in. It is especially cool that we can use third party libraries within the defined function! I’m assuming this works since DuckDB is bound to the particular python process being run.

For this example, let’s do some random math where we take the absolute difference between two values and then calculate the square root with numpy. If we provide type annotation, DuckDB is smart enough to handle the rest. We just need to select columns where both values are not null so the query doesn’t error out.

def random_func(x: float, y: float) -> float:
    return np.sqrt(abs(x-y))

duckdb.create_function('random_func',random_func);

Now let’s use the function in a query by applying the function to the fare_amount and tip_amount method.

random_ans = duckdb.execute(f"""
        SELECT AVG(random_func(fare_amount,tip_amount)) FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
        WHERE tip_amount IS NOT NULL
        AND fare_amount IS NOT NULL
        """).fetchall()[0][0]

print(f"Average value of our random function {random_ans:.2f}")

Average value of our random function 3.07

Introducing this python function did cause the query to be much slower than previous operations. Unlike other queries which quickly brought my CPU usage up to 100%, this one kept my CPU at about 20% for the entire operation. So it is worth trying to avoid python UDFs as the data size becomes large or performance is important.

Final Thoughts

After playing around with DuckDB for a bit, I’m pretty impressed. I think it has an obvious use case for analytical workloads where the data which is larger your computer’s memory, where performance is important, and where its simplicity outweighs potential gains from more heavyweight big data analysis services. Since DuckDB deals with databases in a column based format, I also think it is a natural candidate to process parquet files.

DuckDB Performance

Here are timed benchmarks for all queries performed above. As a reminder, the dataset is 28 GB compressed, 59 GB uncompressed, and has 1.72 billion rows.

These numbers will depend on hardware, so for reference I have a AMD 3600 CPU and the data is stored on a Samsung 970 EVO SSD which has quoted read/write times of 3500 MB/s and 3,300 MB/s respectively. Based on these speeds, I would expect loading in all columns of all parquet files into memory to take at least 28 GB / 3.5 GB/s \(\approx\) 8 seconds if I had that much RAM (which I don’t). Most of the default DuckDB queries which require scanning the all files are on the same time scale of around 10 seconds or less. I’m assuming DuckDB can achieve these times because it can selectively load only the parts of the dataset it needs to execute the query.

Query: Getting total compressed size from parquet metadata

%%timeit

duckdb.execute(f"""
SELECT SUM(total_compressed_size)/(1024*1024*1024) FROM parquet_metadata({list(taxi_df.filename)})
""").fetchall()[0][0]

79.5 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: Getting total uncompressed size from parquet metadata

%%timeit

duckdb.execute(f"""
SELECT SUM(total_uncompressed_size)/(1024*1024*1024) FROM parquet_metadata({list(taxi_df.filename)})
""").fetchall()[0][0]

82.8 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Query: Getting filename and column names from metadata

%%timeit

duckdb.execute(f"""
SELECT file_name, path_in_schema FROM parquet_metadata({list(taxi_df.filename)})
""").df()

78 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Query: Get total number of transactions

%%timeit

duckdb.execute(f"""
SELECT COUNT(*) FROM read_parquet({list(taxi_df.filename)})
""").fetchall()[0][0]

78.1 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Query: Get Approx. Percentiles of Total Amount Column

%%timeit

duckdb.execute(f"""
SELECT approx_quantile(total_amount,[{",".join([str(x) for x in percentiles])}]) 
FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
WHERE total_amount > 0
""").fetchall()[0][0]

11.5 s ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: Get Approx Percentiles of Total Amount Column Per Year

%%timeit

pcts_over_years = dict()
for year in range(2010,2024,1):
    year_fnames = taxi_df[taxi_df['year']==year]['filename']
    year_pcts = duckdb.execute(f"""
    SELECT approx_quantile(total_amount,[{",".join([str(x) for x in percentiles])}]) 
    FROM read_parquet({list(year_fnames)},union_by_name=true)
    WHERE total_amount > 0""").fetchall()[0][0]
    year_dict = { str(year):year_pcts }
    pcts_over_years = dict(pcts_over_years, **year_dict)

12.8 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: Get Number of Transactions Per Payment Type

%%timeit

duckdb.execute(f"""
SELECT payment_type, COUNT(*) as occurrence_count
FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
GROUP BY payment_type; """).df()

5.73 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: Get Number of Cash Transactions Total

%%timeit

total_cash = duckdb.execute(f"""
SELECT COUNT(*) FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
WHERE payment_type IN (2, 'CASH','Cash','CSH', 'CAS', 'Cas')
""").fetchall()[0][0]

6.4 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: Get Number of Credit Transactions Total

%%timeit

duckdb.execute(f"""
SELECT COUNT(*) FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
WHERE payment_type IN (1, 'Credit', 'CRD', 'Cre','CRE', 'CREDIT')
""").fetchall()[0][0]  

6.52 s ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: Get Number of Cash/Credit Transactions Per Year

%%timeit

years = list(range(2010,2024,1))
cash_years = []
credit_years = []

for year in years:
    year_fnames = taxi_df[taxi_df['year']==year]['filename']
    
    cash_years.append(
        duckdb.execute(f"""
        SELECT COUNT(*) FROM read_parquet({list(year_fnames)},union_by_name=true)
        WHERE payment_type IN (2, 'CASH','Cash','CSH', 'CAS', 'Cas')
        """).fetchall()[0][0])

    credit_years.append(
        duckdb.execute(f"""
        SELECT COUNT(*) FROM read_parquet({list(year_fnames)},union_by_name=true)
        WHERE payment_type IN (1, 'Credit', 'CRD', 'Cre','CRE', 'CREDIT')
        """).fetchall()[0][0])    

13.2 s ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Query: User Defined Function

#Got impatient since this is a really slow query
from timeit import default_timer as timer

start = timer()
duckdb.execute(f"""
        SELECT AVG(random_func(fare_amount,tip_amount)) FROM read_parquet({list(taxi_df.filename)},union_by_name=true)
        WHERE tip_amount IS NOT NULL
        AND fare_amount IS NOT NULL
        """).fetchall()[0][0]
end = timer()

print(f"User Defined Function Took {end-start:.2e} seconds")

User Defined Function Took 2.09e+03 seconds

Which is about 35 minutes.

Table of Contents