ggplot

ggplot#

New in version 0.7:

pip install jupysql --upgrade

Note

ggplot API requires matplotlib: pip install matplotlib

The ggplot API is structured around the principles of the grammar of graphics, and allows you to build any graph using the same components: a data set, a coordinate system, and geoms (geometric objects).

To make it suitble for JupySQL, specifically for the purpose of running SQL and plotting larger-than-memory datasets on any laptop, we made a small modification from the original ggplot2 API. Rather than providing a dataset, we now provide a SQL table name.

Other than that, at this point we support:

Aes:

x - a SQL column mapping
color and fill to apply edgecolor and fill colors to plot shapes

Geoms:

geom_boxplot
geom_histogram

Facet:

facet_wrap to display multiple plots in 1 layout

Please note that each geom has its own unique attributes, e.g: number of bins in geom_histogram. We’ll cover all the possible parameters in this tutorial.

Building a graph#

To build a graph, we first should initialize a ggplot instance with a reference to our SQL table using the table parameter, and a mapping object. Here’s is the complete template to build any graph.

(
    ggplot(table='sql_table_name', mapping=aes(x='table_column_name'))
    +
    geom_func() # geom_histogram or geom_boxplot (required)
    +
    facet_func() # facet_wrap (optional)
)

Note

Please note this is the 1st release of ggplot API. We highly encourage you to provide us with your feedback through our Slack channel to assist us in improving the API and addressing any issues as soon as possible.

Examples#

First, establish the connection, import necessary functions and prepare the data.

Setup#

%load_ext sql
%sql duckdb://

from sql.ggplot import ggplot, aes, geom_boxplot, geom_histogram, facet_wrap

from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"

if not Path("yellow_tripdata_2021-01.parquet").is_file():
    urlretrieve(url, "yellow_tripdata_2021-01.parquet")

Boxplot#

(ggplot("yellow_tripdata_2021-01.parquet", aes(x="trip_distance")) + geom_boxplot())

<sql.ggplot.ggplot.ggplot at 0x7f731a317a90>

../_images/6e85218edf21278de74380084ce49a39a6f6da1c9bc5e5c94c5b0f9a34b8d279.png

Histogram#

To make it more interesting, let’s create a query that filters by the 90th percentile. Note that we’re using the --save, and --no-execute functions. This tells JupySQL to store the query, but skips execution. We’ll reference it in our next plotting calls using the with_ parameter.

%%sql --save short_trips --no-execute
select * from 'yellow_tripdata_2021-01.parquet'
WHERE trip_distance < 6.3

Running query in 'duckdb://'

Skipping execution...

(
    ggplot(table="short_trips", with_="short_trips", mapping=aes(x="trip_distance"))
    + geom_histogram(bins=10)
)

<sql.ggplot.ggplot.ggplot at 0x7f7308471b40>

../_images/cf31e66505803681ba6bfafd97f018d215c7bf347d870ed022e2a6ec41635cce.png

Custom Style#

By modifying the fill and color attributes, we can apply our custom style

(
    ggplot(
        table="short_trips",
        with_="short_trips",
        mapping=aes(x="trip_distance", fill="#69f0ae", color="#fff"),
    )
    + geom_histogram(bins=10)
)

<sql.ggplot.ggplot.ggplot at 0x7f72e0b79f30>

../_images/30ddf289b3bb068c48baf2e26e995aa6ad79c590d8918cdca9b4cca3fb16b6af.png

When using multiple columns we can apply color on each column

(
    ggplot(
        table="short_trips",
        with_="short_trips",
        mapping=aes(
            x=["PULocationID", "DOLocationID"],
            fill=["#d500f9", "#fb8c00"],
            color="white",
        ),
    )
    + geom_histogram(bins=10)
)

<sql.ggplot.ggplot.ggplot at 0x7f72d6fd6860>

../_images/74e96e156582000dde9e562fc3d2ed3dbfdf37272ccdc12665e3cd352dc30961.png

Categorical histogram#

To make it easier to demonstrate, let’s use ggplot2 diamonds dataset.

from pathlib import Path
from urllib.request import urlretrieve

if not Path("diamonds.csv").is_file():
    urlretrieve(
        "https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv",  # noqa
        "diamonds.csv",
    )

%%sql
CREATE TABLE diamonds AS SELECT * FROM diamonds.csv

Running query in 'duckdb://'

Count

Now, let’s create a histogram of the different cuts of the diamonds by setting x='cut'. Please note, since the values of cut are strings, we don’t need the bins attribute here.

(ggplot("diamonds", aes(x="cut")) + geom_histogram())

<sql.ggplot.ggplot.ggplot at 0x7f72d6666050>

../_images/531b21dceaff9be6089164ae9e4471fa92f60ede250ceff2cf3329b17562b10e.png

We can show a histogram of multiple columns by setting x=['cut', 'color']

(ggplot("diamonds", aes(x=["cut", "color"])) + geom_histogram())

<sql.ggplot.ggplot.ggplot at 0x7f72d6667400>

../_images/2f43987cb5b52f8313835e352bcba6f46b317fa7ff465153ab101173021cb63f.png

We can also plot histograms for a combination of categorical and numerical columns.

(ggplot("diamonds", aes(x=["color", "carat"])) + geom_histogram(bins=30))

<sql.ggplot.ggplot.ggplot at 0x7f72d6f9ba90>

../_images/9d628cd44b99e69815b92c8fdce2af8db753c25e8d83a4e814d7e5be3e03214e.png

Apply a custom color with color and fill

(
    ggplot("diamonds", aes(x="price", fill="green", color="white"))
    + geom_histogram(bins=10, fill="cut")
)

<sql.ggplot.ggplot.ggplot at 0x7f72d71e0b50>

../_images/6160e6f766f55501a9076fe6abd5c3e70bc3f0e736aacde9dc9d8bde5d44a0d5.png

If we map the fill attribute to a different variable such as cut, the bars will stack automatically. Each colored rectangle on the stacked bars will represent a unique combination of price and cut.

(ggplot("diamonds", aes(x="price")) + geom_histogram(bins=10, fill="cut"))

<sql.ggplot.ggplot.ggplot at 0x7f72d66647f0>

../_images/86122a05bbe1d910b3f4923a6d38537421405e95cc0f191c275d45ae529d9dd7.png

We can apply a different coloring using cmap

(
    ggplot("diamonds", aes(x="price"))
    + geom_histogram(bins=10, fill="cut", cmap="plasma")
)

<sql.ggplot.ggplot.ggplot at 0x7f72d6c4c2b0>

../_images/35ae0de6cb4dac6908d2aba0844aa2cbafa13149fb574d6797ed0debb423622d.png