Posit, formerly RStudio, hosts an annual conference including workshops, lunch-and-learns, and presentations touching on all things data science.
R represented the majority of the presentations but there was more Python content than in prior years
Posit | Learn more at https://posit.co/
Conference
Highlights
Just to clarify… That’s Quarto… Q-U-A-R-T-O, not #4 in Spanish (or “cuatro”).
These web slides were built with Quarto and this section was copied almost entirely from Quarto’s Reveal.js demo documentation
The next few slides cover what you can do with Quarto and Reveal.js including:
Quarto | Learn more on Quarto’s Reveal.js demo presentation
# Define a server for the Shiny app
function(input, output) {
# Fill in the spot we created for a plot
output$phonePlot <- renderPlot({
# Render a barplot
})
}
Quarto | Syntax Highlighting
# Define a server for the Shiny app
function(input, output) {
# Fill in the spot we created for a plot
output$phonePlot <- renderPlot({
# Render a barplot
barplot(WorldPhones[,input$region]*1000,
main=input$region,
ylab="Number of Telephones",
xlab="Year")
})
}
Quarto | Code Animations
import numpy as np
import matplotlib.pyplot as plt
r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
Quarto | Line Highlighting
index.qmd
---
title: "HTML document"
format: html
---
## Quarto
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.
## Running Code
When you click the **Render** button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
::: {.cell}
:::
You can add options to executable code like this
::: {.cell}
:::
The `echo: false` option disables the printing of code (only output is displayed).
index.html
Quarto | codewindow Reveal.js extension & example code above from Emil Hvitfeldt’s Quarto Theming Talk at Posit Conf
Great for training purposes or when you’ve got a time limit
00:10
Quarto | Learn more about the countdown package
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
.parquet
dataset was partitioned by Year and Month (120 files)Arrow | Learn more at https://arrow.apache.org/
benchmark_arrow_read.R
# 0. Load Libraries to Download & Benchmark Data
library(here)
library(arrow)
library(dplyr)
library(ggplot2)
library(bench)
# 1. NYC Taxi Data download (40 GB)
data_path <- here::here("data/nyc-taxi")
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
filter(year %in% 2012:2021) |>
write_dataset(data_path, partitioning = c("year", "month"))
# 2. Benchmark Read Times
bnch <- bench::mark(
min_iterations = 1000,
arrow = open_dataset(here::here("data/nyc-taxi"))
)
autoplot(bnch)
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
benchmark_arrow_dplyr.R
# 1. Open Arrow connection to dataset (40 GB)
nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))
# 2. Benchmark dplyr pipeline
bnch <- bench::mark(
min_iterations = 10,
arrow = nyc_taxi |>
dplyr::group_by(year) |>
dplyr::summarise(all_trips = n(),
shared_trips = sum(passenger_count > 1, na.rm = T)) |>
dplyr::mutate(pct_shared = shared_trips / all_trips * 100) |>
dplyr::collect()
)
autoplot(bnch)
Arrow | Arrow + dplyr compatibility
Arrow | Arrow + dplyr compatibility
stringr::str_replace_na()
example:stringr
function is not supported by ArrowArrow | Arrow + dplyr compatibility
but wait!
problem solved
str_replace_na()
with the Arrow kernelArrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
arrow::schema()
on your Arrow table to review the field name / data type pairsvendor_name
field, I know I’ll be working with an Arrow string()
data typeArrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
register_scalar_function()
auto_convert = TRUE
Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
Eevee can’t believe it… “Wooooooooooow, Javi!”
ADBC: Arrow Database Connectivity
ADBC (Arrow Database Connectivity) | Learn more at https://arrow.apache.org/docs/format/ADBC.html
ADBC: Arrow Database Connectivity
ADBC (Arrow Database Connectivity) | Learn more at https://arrow.apache.org/docs/format/ADBC.html
Arrow Flight SQL
Arrow Flight SQL | Learn more at https://arrow.apache.org/docs/format/FlightSql.html
duckplyr by DuckDB Labs | Learn more at https://duckdblabs.com/
duckdb
package for Python and offers a state-of-the-art optimizer that pushes down filters and projections directly into Arrow scansDuckDB Labs | Learn how DuckDB quacks Arrow
DuckDB Labs | Learn how DuckDB quacks Arrow
library(duckdb)
library(arrow)
library(dplyr)
ds <- arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
ds |>
filter(year > 2014 & passenger_count > 0 &
trip_distance > 0.25 & fare_amount > 0) |>
# Pass off to DuckDB
to_duckdb() |>
group_by(passenger_count) |>
mutate(tip_pct = tip_amount / fare_amount) |>
summarise(fare_amount = mean(fare_amount, na.rm = TRUE),
tip_amount = mean(tip_amount, na.rm = TRUE),
tip_pct = mean(tip_pct, na.rm = TRUE)) |>
arrange(passenger_count) |>
collect()
import duckdb
import pyarrow as pa
import pyarrow.dataset as ds
# Open dataset using year,month folder partition
nyc = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# We transform the nyc dataset into a DuckDB relation
nyc = duckdb.arrow(nyc)
# Run same query again
nyc.filter("year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amount > 0")
.aggregate("SELECT AVG(fare_amount), AVG(tip_amount), AVG(tip_amount / fare_amount) as tip_pct","passenger_count").arrow()
DuckDB Labs | Learn how DuckDB quacks Arrow
# Reads dataset partitioning it in year/month folder
nyc_dataset = open_dataset("nyc-taxi/", partitioning = c("year", "month"))
# Gets Database Connection
con <- dbConnect(duckdb::duckdb())
# We can use the same function as before to register our arrow dataset
duckdb::duckdb_register_arrow(con, "nyc", nyc_dataset)
res <- dbSendQuery(con, "SELECT * FROM nyc", arrow = TRUE)
# DuckDB's queries can now produce a Record Batch Reader
record_batch_reader <- duckdb::duckdb_fetch_record_batch(res)
# Which means we can stream the whole query per batch.
# This retrieves the first batch
cur_batch <- record_batch_reader$read_next_batch()
# Reads dataset partitioning it in year/month folder
nyc_dataset = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# Gets Database Connection
con = duckdb.connect()
query = con.execute("SELECT * FROM nyc_dataset")
# DuckDB's queries can now produce a Record Batch Reader
record_batch_reader = query.fetch_record_batch()
# Which means we can stream the whole query per batch.
# This retrieves the first batch
chunk = record_batch_reader.read_next_batch()
DuckDB Labs | Learn how DuckDB quacks Arrow
# DuckDB via Python
# Open dataset using year,month folder partition
nyc = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# Get database connection
con = duckdb.connect()
# Run query that selects part of the data
query = con.execute("SELECT total_amount, passenger_count,year FROM nyc where total_amount > 100 and year > 2014")
# Create Record Batch Reader from Query Result.
# "fetch_record_batch()" also accepts an extra parameter related to the desired produced chunk size.
record_batch_reader = query.fetch_record_batch()
# Retrieve all batch chunks
chunk = record_batch_reader.read_next_batch()
while len(chunk) > 0:
chunk = record_batch_reader.read_next_batch()
# We must exclude one of the columns of the NYC dataset due to an unimplemented cast in Arrow
working_columns = ["vendor_id","pickup_at","dropoff_at","passenger_count","trip_distance","pickup_longitude",
"pickup_latitude","store_and_fwd_flag","dropoff_longitude","dropoff_latitude","payment_type",
"fare_amount","extra","mta_tax","tip_amount","tolls_amount","total_amount","year", "month"]
# Open dataset using year,month folder partition
nyc_dataset = ds.dataset(dir, partitioning=["year", "month"])
# Generate a scanner to skip problematic column
dataset_scanner = nyc_dataset.scanner(columns=working_columns)
# Materialize dataset to an Arrow Table
nyc_table = dataset_scanner.to_table()
# Generate Dataframe from Arow Table
nyc_df = nyc_table.to_pandas()
# Apply Filter
filtered_df = nyc_df[
(nyc_df.total_amount > 100) &
(nyc_df.year >2014)]
# Apply Projection
res = filtered_df[["total_amount", "passenger_count","year"]]
# Transform Result back to an Arrow Table
new_table = pa.Table.from_pandas(res)
DuckDB Labs | Learn how DuckDB quacks Arrow
DuckDB Labs | Learn how DuckDB quacks Arrow
duckplyr
, from DuckDB Labs, offers 1:1 compatibility with dplyr
functions but there are some caveats:
.by
in dplyr::summarize()
as dplyr::group_by()
will not be supported by the developersduckplyr | Learn more at https://duckdblabs.com/
dplyr
syntaxdplyr
, arrow
, duckdb
, and duckplyr
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
tidymodels | Learn more at https://www.tidymodels.org/
tidymodels | Learn more at https://www.tidymodels.org/
tidymodels | Learn more at https://www.tidymodels.org/
tidymodels | Learn more at https://www.tidymodels.org/
tidymodels | Learn more at https://www.tidymodels.org/
usemodels
package generates easy-to-use boilerplate code for modelingusemodels::use_glmnet()
generates code that eases the setup of using the glmnet algorithm, and others, within the tidymodels
frameworkusemodels | Learn more at https://usemodels.tidymodels.org/
glmnet_recipe <- recipe(formula = body_mass_g ~ ., data = penguins) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
glmnet_spec <- multinom_reg(penalty = tune(), mixture = tune()) |>
set_mode("classification") |>
set_engine("glmnet")
glmnet_workflow <- workflow() |>
add_recipe(glmnet_recipe) |>
add_model(glmnet_spec)
glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20),
mixture = c(0.05, 0.2, 0.4, 0.6, 0.8, 1))
glmnet_tune <-
tune_grid(glmnet_workflow,
resamples = stop("add your rsample object"),
grid = glmnet_grid)
usemodels | Learn more at https://usemodels.tidymodels.org/
usemodels
isn’t new but still in early developmentusemodels::use_*()
variants below:usemodels | Learn more at https://usemodels.tidymodels.org/
The main two strategies for optimization are:
Grid search 💠 which tests a pre-defined set of candidate values
Iterative search 🌀 which suggests/estimates new values of candidate parameters to evaluate
Slide credit | Posit, Advanced Tidymodels Workshop
Slide & illustration credit | Posit, Advanced Tidymodels Workshop
Slide & illustration credit | Posit, Advanced Tidymodels Workshop
tune
package:Slide & illustration credit | Posit, Advanced Tidymodels Workshop
We can make a “meta-model” with a small set of historical performance results
Gaussian Processes (GP) models are a good choice to model performance
It is a Bayesian model so we are using Bayesian Optimization (BO)
The GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. MAE)
Slide credit | Posit, Advanced Tidymodels Workshop
We’ll use an acquisition function to select a new candidate
The most popular method appears to be Expected Improvement (EI) above the current best results
We would probably pick the point with the largest EI as the next point to iterate
Slide credit | Posit, Advanced Tidymodels Workshop
Once we pick the candidate point, we measure performance for it (e.g. resampling)
Another GP is fit, EI is recomputed, and so on
We stop when we have completed the allowed number of iterations or if we don’t see any improvement after a pre-set number of attempts
Slide credit | Posit, Advanced Tidymodels Workshop
Slide & illustration credit | Posit, Advanced Tidymodels Workshop
We’ll use a function called tune_bayes()
that has very similar syntax to tune_grid()
It has an additional initial
argument for the initial set of performance estimates and parameter combinations for the GP model
initial
can be the results of another tune_*()
function or an integer (in which case tune_grid()
is used under to hood)
Max Kuhn suggests at least the # of tuning parameters plus two as the initial
grid
tune | Learn more about tune, a tidymodels package
reg_metrics <- metric_set(mae, rsq)
set.seed(12)
init_res <-
lgbm_wflow %>%
tune_grid(
resamples = hotel_rs,
grid = nrow(lgbm_param) + 2, # lgbm_param <- workflow() |> extract_parameter_set_dials()
param_info = lgbm_param,
metrics = reg_metrics
)
show_best(init_res, metric = "mae")
#> # A tibble: 5 × 11
#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config
#> <int> <int> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 390 10 0.0139 13 62 mae standard 11.3 10 0.202 Preprocessor1_Model1
#> 2 718 31 0.00112 72 25 mae standard 29.0 10 0.335 Preprocessor4_Model1
#> 3 1236 22 0.0000261 11 17 mae standard 51.8 10 0.416 Preprocessor7_Model1
#> 4 1044 25 0.00000832 34 12 mae standard 52.8 10 0.424 Preprocessor5_Model1
#> 5 1599 7 0.0000000402 254 179 mae standard 53.2 10 0.427 Preprocessor6_Model1
tune | Learn more about tune, a tidymodels package
set.seed(15)
lgbm_bayes_res <-
lgbm_wflow %>%
tune_bayes(
resamples = hotel_rs,
initial = init_res, # <- initial results
iter = 20,
param_info = lgbm_param, # lgbm_param <- workflow() |> extract_parameter_set_dials()
metrics = reg_metrics
)
show_best(lgbm_bayes_res, metric = "mae")
#> # A tibble: 5 × 12
#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config .iter
#> <int> <int> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
#> 1 1769 2 0.0299 114 245 mae standard 9.41 10 0.160 Iter13 13
#> 2 1969 3 0.0270 240 99 mae standard 9.49 10 0.189 Iter11 11
#> 3 1780 5 0.0403 27 78 mae standard 9.54 10 0.164 Iter17 17
#> 4 1454 3 0.0414 114 10 mae standard 9.55 10 0.144 Iter10 10
#> 5 1253 2 0.0312 131 207 mae standard 9.56 10 0.145 Iter19 19
tune | Learn more about tune, a tidymodels package
tune | Learn more about tune, a tidymodels package
tune | Learn more about tune, a tidymodels package
posit::conf(2023) | Watch keynote presentations and review talk materials