Loading and Querying Excel Files#

In this tutorial, we will be using small financial data stored in an Excel file containing over 700 records. The dataset is publicly available here. We will use the read_excel function from the pandas library to read the Excel file and store it in the database using the %sql --persist command of jupysql, which works across multiple databases. For additional compatibility between different databases and jupysql, please check out this page.

Note

DuckDB doesn’t support reading excel files. Their excel extension provides excel like formatting.

Note

For this tutorial, we aim to showcase the versatility of jupysql as a framework by using --persist. However, DuckDB natively supports Pandas DataFrame and you do not need to use --persist. With DuckDB, complex queries such as aggregations and joins can run more efficiently on the DataFrame compared to Pandas native functions. You can refer to this blog for a detailed comparison (Note: the comparison is based on Pandas v1.*, not the recently released Pandas v2.*, which uses PyArrow as a backend).

Installing dependencies:

%pip install jupysql duckdb duckdb-engine pandas openpyxl --quiet
Note: you may need to restart the kernel to use updated packages.

Reading dataframe using pandas.read_excel:

import pandas as pd

df = pd.read_excel("https://go.microsoft.com/fwlink/?LinkID=521962")

Initializing jupysql and connecting to duckdb database

%load_ext sql
%sql duckdb://
Connecting to 'duckdb://'

Persisting the dataframe in duckdb database. It is stored in table named df.

# If you are using DuckDB, you can omit this cell
%sql --persist df
Running query in 'duckdb://'
Success! Persisted df to the database.

Running some standard queries#

  • Selecting first 3 queries

%%sql 
SELECT *
FROM df
LIMIT 3
Running query in 'duckdb://'
index Segment Country Product Discount Band Units Sold Manufacturing Price Sale Price Gross Sales Discounts Sales COGS Profit Date Month Number Month Name Year
0 Government Canada Carretera None 1618.5 3 20 32370.0 0.0 32370.0 16185.0 16185.0 2014-01-01 00:00:00 1 January 2014
1 Government Germany Carretera None 1321.0 3 20 26420.0 0.0 26420.0 13210.0 13210.0 2014-01-01 00:00:00 1 January 2014
2 Midmarket France Carretera None 2178.0 3 15 32670.0 0.0 32670.0 21780.0 10890.0 2014-06-01 00:00:00 6 June 2014
  • Countries in the database

%%sql 
SELECT DISTINCT Country
FROM df
Running query in 'duckdb://'
Country
Canada
Germany
France
Mexico
United States of America
  • Evaluating total profit country-wise and ordering them in desceding order according to profit.

%%sql
select Country, SUM(Profit) Total_Profit
from df
group by Country
order by Total_Profit DESC
Running query in 'duckdb://'
Country Total_Profit
France 3781020.7800000007
Germany 3680388.8200000008
Canada 3529228.8850000002
United States of America 2995540.664999999
Mexico 2907523.1100000003