SoCal RUG x UCI Hackathon - 2024.04.27
dplyr
dplyr
is a data manipulation package for R that offers SQL-like functions such as select()
, filter()
, summarize()
, and more to simplify data taskspandas
pandas
is a data analysis library in Python that supports the same kind of selecting, filtering, and manipulation taskswe need
data!
palmerpenguins
| Package available on CRAN for R and PyPI for Python
palmerpenguins
packagepalmerpenguins
| Package available on CRAN for R and PyPI for Python
palmerpenguins
and load the curated data setpalmerpenguins
| Package available on CRAN for R and PyPI for Python
glimpse()
at the DataRows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
palmerpenguins
| Package available on CRAN for R and PyPI for Python
dplyr | Learn more about dplyr
select()
functiondplyr | Learn more about select()
select()
let us easily subset the datadplyr | Learn more about select()
filter()
functionRows: 66
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.5, 36.7, 38.9, 36.6, 38.7, 34.4, 37.8, 35.9, 35.3…
$ bill_depth_mm <dbl> 17.4, 19.3, 17.8, 17.8, 19.0, 18.4, 18.3, 19.2, 18.9…
$ flipper_length_mm <int> 186, 193, 181, 185, 195, 184, 174, 189, 187, 172, 17…
$ body_mass_g <int> 3800, 3450, 3625, 3700, 3450, 3325, 3400, 3800, 3800…
$ sex <fct> female, female, female, female, female, female, fema…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
dplyr | Learn more about filter()
filter()
function let’s us keep rows where the provided expression(s) are TRUE
dplyr | Learn more about filter()
arrange()
functionbill_length_mm
ascending and by year
descendingdplyr | Learn more about arrange()
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Dream, Dream, Torgersen, Dream, Torgersen, Torgersen…
$ bill_length_mm <dbl> 32.1, 33.1, 33.5, 34.0, 34.1, 34.4, 34.5, 34.6, 34.6…
$ bill_depth_mm <dbl> 15.5, 16.1, 19.0, 17.1, 18.1, 18.4, 18.1, 17.2, 21.1…
$ flipper_length_mm <int> 188, 178, 190, 185, 193, 184, 187, 189, 198, 192, 19…
$ body_mass_g <int> 3050, 2900, 3600, 3400, 3475, 3325, 2900, 3200, 4400…
$ sex <fct> female, female, female, female, NA, female, female, …
$ year <int> 2009, 2008, 2008, 2008, 2007, 2007, 2008, 2008, 2007…
dplyr | Learn more about arrange()
arrange()
function sorts rows in order of one or more columnsdesc()
dplyr | Learn more about arrange()
mutate()
Rows: 344
Columns: 9
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_kg <dbl> 3.750, 3.800, 3.250, NA, 3.450, 3.650, 3.625, 4.675,…
dplyr | Learn more about mutate()
mutate()
mutate()
is similar to base R’s transform()
with the added benefit of being able to reference columns you’ve just createddplyr | Learn more about mutate()
group_by()
variables and summarize()
resultsdplyr | Learn more about summarize()
summarize()
creates a new data framedplyr | Learn more about summarize()
|>
) or magrittr pipe (%>%
) can be used to chain or pipe dplyr functions togetherpenguins_mod <- penguins_df |>
select(sex, bill_length_mm, body_mass_g) |>
filter(sex == "female",
bill_length_mm > 30) |>
group_by(sex) |>
summarize(mean_mass_g = mean(body_mass_g, na.rm = TRUE))
glimpse(penguins_mod)
Rows: 1
Columns: 2
$ sex <fct> female
$ mean_mass_g <dbl> 3862.273
dplyr | Learn more about piping
.
) is used as the chain functions togetherArrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
.parquet
dataset was partitioned by Year and Month (120 files)Arrow | Learn more at https://arrow.apache.org/
# Install full Arrow with S3 and GCS support:
# Sys.setenv("LIBARROW_MINIMAL" = FALSE)
# Sys.setenv("ARROW_S3" = "ON")
# Sys.setenv("ARROW_GCS" = "ON")
# install.packages("arrow")
library(here)
library(arrow)
library(dplyr)
library(ggplot2)
library(bench)
# NYC Taxi Data download (40 GB)
data_path <- here::here("data/nyc-taxi")
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
filter(year %in% 2012:2021) |>
write_dataset(data_path, partitioning = c("year", "month"))
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
# 1. Open Arrow connection to dataset (40 GB)
nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))
# 2. Benchmark dplyr pipeline
bnch <- bench::mark(
min_iterations = 10,
arrow = nyc_taxi |>
dplyr::group_by(year) |>
dplyr::summarise(all_trips = n(),
shared_trips = sum(passenger_count > 1, na.rm = T)) |>
dplyr::mutate(pct_shared = shared_trips / all_trips * 100) |>
dplyr::collect()
)
autoplot(bnch)
Arrow | Arrow + dplyr compatibility
Arrow | Arrow + dplyr compatibility
stringr::str_replace_na()
example:stringr
function is not supported by ArrowArrow | Arrow + dplyr compatibility
but wait!
problem solved
str_replace_na()
with the Arrow kernelArrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
arrow::schema()
on your Arrow table to review the field name / data type pairsvendor_name
field, I know I’ll be working with an Arrow string()
data typeArrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
register_scalar_function()
auto_convert = TRUE
Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
Eevee can’t believe it… “Wooooooooooow, Javi!”
ADBC: Arrow Database Connectivity
ADBC (Arrow Database Connectivity) | Learn more at https://arrow.apache.org/docs/format/ADBC.html
ADBC: Arrow Database Connectivity
ADBC (Arrow Database Connectivity) | Learn more at https://arrow.apache.org/docs/format/ADBC.html
Arrow Flight SQL
Arrow Flight SQL | Learn more at https://arrow.apache.org/docs/format/FlightSql.html
duckplyr by DuckDB Labs | Learn more at https://duckdblabs.com/
duckdb
package for Python offers a state-of-the-art optimizer that pushes down filters and projections directly into Arrow scansDuckDB Labs | Learn how DuckDB quacks Arrow
import os
# Set environment variables
os.environ["LIBARROW_MINIMAL"] = "FALSE"
os.environ["ARROW_S3"] = "ON"
os.environ["ARROW_GCS"] = "ON"
!pip install pyarrow
!pip install duckdb
import pyarrow as pa
import pyarrow.dataset as ds
import duckdb
dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi", format="parquet")
filtered_data = dataset.to_table(filter=ds.field('year').isin(range(2012, 2022)))
filtered_data.write_to_dataset("<your local write path>", partitioning=["year", "month"])
DuckDB Labs | Learn how DuckDB quacks Arrow
library(duckdb)
library(arrow)
library(dplyr)
ds <- arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
ds |>
filter(year > 2014 & passenger_count > 0 &
trip_distance > 0.25 & fare_amount > 0) |>
# Pass off to DuckDB
to_duckdb() |>
group_by(passenger_count) |>
mutate(tip_pct = tip_amount / fare_amount) |>
summarise(fare_amount = mean(fare_amount, na.rm = TRUE),
tip_amount = mean(tip_amount, na.rm = TRUE),
tip_pct = mean(tip_pct, na.rm = TRUE)) |>
arrange(passenger_count) |>
collect()
import duckdb
import pyarrow as pa
import pyarrow.dataset as ds
# Open dataset using year,month folder partition
nyc = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# We transform the nyc dataset into a DuckDB relation
nyc = duckdb.arrow(nyc)
# Run same query again
nyc.filter("year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amount > 0")
.aggregate("SELECT AVG(fare_amount), AVG(tip_amount), AVG(tip_amount / fare_amount) as tip_pct","passenger_count").arrow()
DuckDB Labs | Learn how DuckDB quacks Arrow
# Reads dataset partitioning it in year/month folder
nyc_dataset = open_dataset("nyc-taxi/", partitioning = c("year", "month"))
# Gets Database Connection
con <- dbConnect(duckdb::duckdb())
# We can use the same function as before to register our arrow dataset
duckdb::duckdb_register_arrow(con, "nyc", nyc_dataset)
res <- dbSendQuery(con, "SELECT * FROM nyc", arrow = TRUE)
# DuckDB's queries can now produce a Record Batch Reader
record_batch_reader <- duckdb::duckdb_fetch_record_batch(res)
# Which means we can stream the whole query per batch.
# This retrieves the first batch
cur_batch <- record_batch_reader$read_next_batch()
# Reads dataset partitioning it in year/month folder
nyc_dataset = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# Gets Database Connection
con = duckdb.connect()
query = con.execute("SELECT * FROM nyc_dataset")
# DuckDB's queries can now produce a Record Batch Reader
record_batch_reader = query.fetch_record_batch()
# Which means we can stream the whole query per batch.
# This retrieves the first batch
chunk = record_batch_reader.read_next_batch()
DuckDB Labs | Learn how DuckDB quacks Arrow
# DuckDB via Python
# Open dataset using year,month folder partition
nyc = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# Get database connection
con = duckdb.connect()
# Run query that selects part of the data
query = con.execute("SELECT total_amount, passenger_count,year FROM nyc where total_amount > 100 and year > 2014")
# Create Record Batch Reader from Query Result.
# "fetch_record_batch()" also accepts an extra parameter related to the desired produced chunk size.
record_batch_reader = query.fetch_record_batch()
# Retrieve all batch chunks
chunk = record_batch_reader.read_next_batch()
while len(chunk) > 0:
chunk = record_batch_reader.read_next_batch()
# We must exclude one of the columns of the NYC dataset due to an unimplemented cast in Arrow
working_columns = ["vendor_id","pickup_at","dropoff_at","passenger_count","trip_distance","pickup_longitude",
"pickup_latitude","store_and_fwd_flag","dropoff_longitude","dropoff_latitude","payment_type",
"fare_amount","extra","mta_tax","tip_amount","tolls_amount","total_amount","year", "month"]
# Open dataset using year,month folder partition
nyc_dataset = ds.dataset(dir, partitioning=["year", "month"])
# Generate a scanner to skip problematic column
dataset_scanner = nyc_dataset.scanner(columns=working_columns)
# Materialize dataset to an Arrow Table
nyc_table = dataset_scanner.to_table()
# Generate Dataframe from Arow Table
nyc_df = nyc_table.to_pandas()
# Apply Filter
filtered_df = nyc_df[
(nyc_df.total_amount > 100) &
(nyc_df.year >2014)]
# Apply Projection
res = filtered_df[["total_amount", "passenger_count","year"]]
# Transform Result back to an Arrow Table
new_table = pa.Table.from_pandas(res)
DuckDB Labs | Learn how DuckDB quacks Arrow
DuckDB Labs | Learn how DuckDB quacks Arrow
duckplyr
, from DuckDB Labs, offers 1:1 compatibility with dplyr
functions but there are some caveats:
.by
in dplyr::summarize()
as dplyr::group_by()
will not be supported by the developersduckplyr | Learn more at https://duckdblabs.com/
dplyr
syntaxdplyr
, arrow
, duckdb
, and duckplyr
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
questions?
posit::conf(2023) | Big Data in R with Arrow
posit::conf(2023) | Big Data in R with Arrow
posit::conf(2023) | Big Data in R with Arrow