How-To

How-To#

Query CSV files with SQL#

You can use JupySQL and DuckDB to query CSV files with SQL in a Jupyter notebook.

Installation#

%pip install jupysql duckdb duckdb-engine --quiet

Note: you may need to restart the kernel to use updated packages.

Setup#

Load JupySQL:

%load_ext sql

Create an in-memory DuckDB database:

%sql duckdb://

Connecting to 'duckdb://'

Download some sample data:

from urllib.request import urlretrieve

_ = urlretrieve(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
    "penguins.csv",
)

Query#

%%sql
SELECT *
FROM penguins.csv
LIMIT 3

Running query in 'duckdb://'

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	Torgersen	39.1	18.7	181	3750	MALE
Adelie	Torgersen	39.5	17.4	186	3800	FEMALE
Adelie	Torgersen	40.3	18.0	195	3250	FEMALE

%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC

Running query in 'duckdb://'

species	count
Adelie	152
Gentoo	124
Chinstrap	68

Convert to `polars.DataFrame`#

%%sql results <<
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC

Running query in 'duckdb://'

import polars as pl

pl.DataFrame((tuple(row) for row in results), schema=results.keys)

shape: (3, 2)

species	count
str	i64
"Adelie"	152
"Gentoo"	124
"Chinstrap"	68

Register SQLite UDF#

To register a user-defined function (UDF) when using SQLite, you can use SQLAlchemy’s @event.listens_for and SQLite’s create_function:

Install JupySQL#

%pip install jupysql --quiet

Note: you may need to restart the kernel to use updated packages.

Create engine and register function#

from sqlalchemy import create_engine
from sqlalchemy import event

def mysum(x, y):
    return x + y

engine = create_engine("sqlite://")

@event.listens_for(engine, "connect")
def connect(conn, rec):
    conn.create_function(name="MYSUM", narg=2, func=mysum)

Create connection with existing engine#

New in version 0.5.1: Pass existing engines to %sql

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

%sql engine

Query#

%%sql
SELECT MYSUM(1, 2)

Running query in 'sqlite://'

MYSUM(1, 2)
3

Connect to a SQLite database with spaces#

Currently, due to a limitation in the argument parser, it’s not possible to directly connect to SQLite databases whose path contains spaces; however, you can do it by creating the engine first.

Setup#

%pip install jupysql --quiet

Note: you may need to restart the kernel to use updated packages.

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Connect to db#

from sqlalchemy import create_engine

engine = create_engine("sqlite:///my database.db")

Add some sample data:

import pandas as pd

_ = pd.DataFrame({"x": range(5)}).to_sql("numbers", engine)

%sql engine

%%sql
SELECT * FROM numbers

Running query in 'sqlite:///my database.db'

index	x
0	0
1	1
2	2
3	3
4	4

Switch connections#

New in version 0.5.2: -A/--alias

# create two databases with sample data
from sqlalchemy import create_engine
import pandas as pd

engine_one = create_engine("sqlite:///one.db")
pd.DataFrame({"x": range(5)}).to_sql("one", engine_one)

engine_two = create_engine("sqlite:///two.db")
_ = pd.DataFrame({"x": range(5)}).to_sql("two", engine_two)

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Assign alias to both connections so we can switch them by name:

%sql sqlite:///one.db --alias one
%sql sqlite:///two.db --alias two

Connecting and switching to connection 'one'

Connecting and switching to connection 'two'

%sql

Running query in 'two'

Pass the alias to make it the current connection:

%sql one

Switching to connection 'one'

Tip

We highly recommend you to create a separate cell (%sql some_alias) when switching connections instead of switching and querying in the the same cell.

You can pass an alias and query in the same cell:

%%sql one
SELECT * FROM one

index	x
0	0
1	1
2	2
3	3
4	4

However, this isn’t supported with the line magic (e.g., %sql one SELECT * FROM one).

You can also pass an alias, and assign the output to a variable, but this is discouraged:

%%sql two
result <<
SELECT * FROM two

Switching to connection 'two'

result

index	x
0	0
1	1
2	2
3	3
4	4

Once you pass an alias, it becomes the current active connection:

%sql

Running query in 'two'

Hence, we can skip it in upcoming queries:

%%sql
SELECT * FROM two

Running query in 'two'

index	x
0	0
1	1
2	2
3	3
4	4

Switch connection:

%%sql one
SELECT * FROM one

Switching to connection 'one'

index	x
0	0
1	1
2	2
3	3
4	4

%sql

Running query in 'one'

Close by passing the alias:

%sql --close one

%sql

%sql --close two

%sql -l

Active connections:

current	url	alias
	duckdb://	duckdb://
	sqlite://	sqlite://
	sqlite:///my database.db	sqlite:///my database.db

Connect to existing `engine`#

Pass the name of the engine:

some_engine = create_engine("sqlite:///some.db")

%sql some_engine

Use `%sql`/`%%sql` in Databricks#

Databricks uses the same name (%sql/%%sql) for its SQL magics; however, JupySQL exposes a %jupysql/%%jupysql alias so you can use both:

%jupysql duckdb://

Switching to connection 'duckdb://'

%jupysql SELECT * FROM "penguins.csv" LIMIT 3

Running query in 'duckdb://'

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	Torgersen	39.1	18.7	181	3750	MALE
Adelie	Torgersen	39.5	17.4	186	3800	FEMALE
Adelie	Torgersen	40.3	18.0	195	3250	FEMALE

%%jupysql
SELECT *
FROM "penguins.csv"
LIMIT 3

Running query in 'duckdb://'

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	Torgersen	39.1	18.7	181	3750	MALE
Adelie	Torgersen	39.5	17.4	186	3800	FEMALE
Adelie	Torgersen	40.3	18.0	195	3250	FEMALE

Ignore deprecation warnings#

We display warnings to let you know when the API will change so you have enough time to update your code, if you want to suppress this warnings, add this at the top of your notebook:

import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

Hide connection string#

If you want to hide the connection string, pass an alias

%sql --close duckdb://

%sql duckdb:// --alias myconnection

Connecting and switching to connection 'myconnection'

The alias will be displayed instead of the connection string:

%sql SELECT * FROM 'penguins.csv' LIMIT 3

Running query in 'myconnection'

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	Torgersen	39.1	18.7	181	3750	MALE
Adelie	Torgersen	39.5	17.4	186	3800	FEMALE
Adelie	Torgersen	40.3	18.0	195	3250	FEMALE

How-To

Contents

How-To#

Query CSV files with SQL#

Installation#

Setup#

Query#

Convert to polars.DataFrame#

Register SQLite UDF#

Install JupySQL#

Create engine and register function#

Create connection with existing engine#

Query#

Connect to a SQLite database with spaces#

Setup#

Connect to db#

Switch connections#

Connect to existing engine#

Use %sql/%%sql in Databricks#

Ignore deprecation warnings#

Hide connection string#

Convert to `polars.DataFrame`#

Connect to existing `engine`#

Use `%sql`/`%%sql` in Databricks#