The main objective of this document is to show how to access
transactional data in RDBMS (Relational Database Management
Systems) and store the star database generated from it in another
relational database. Access to the databases is done through the dm
package. The transformations are carried out using the
rolap
package.
Additionally, a multidimensional design case is solved using star databases with two granularities.
First, the initial transactional data is presented. The following shows the implementation of the star databases. Finally, it ends with the conclusions.
We will build on a dataset with transactional data from Czech debit card company specialising on payments at gas stations called CCS, publicly available in a RDBMS.
First, we make a connection by following the instructions they offer, as shown below1.
ccs_db <- RMariaDB::dbConnect(
RMariaDB::MariaDB(),
username = "guest",
password = "relational",
dbname = "ccs",
host = "relational.fit.cvut.cz"
)
Next, from the connection, we can obtain a dm
object,
taking the definition of the keys from the database.
Using the dm::dm_draw(view_type = "all")
function we can
display the database tables with all their fields.
We can access tables as if they were part of a list. If we access the transactions table we can see that it has no data, the data is in the transactions_1k table hat is not related to the other tables.
Therefore, what we do is generate a new dm
object by
selecting the tables that we are going to use and relating table
transactions_1k to the rest of the tables. These operations are
performed below.
ccs_sel_dm <-
ccs_dm[c('transactions_1k', 'customers', 'gasstations', 'products')] |>
dm::dm_add_fk(transactions_1k, CustomerID, customers) |>
dm::dm_add_fk(transactions_1k, GasStationID, gasstations) |>
dm::dm_add_fk(transactions_1k, ProductID, products)
The resulting database schema is shown below.
The starting point of the rolap
package to define the
transformations to obtain a star database is a flat table. To generate a
flat table in a tibble
from the database tables we can use
the functions offered by the dm
package, as shown
below.
transactions_ft <- ccs_sel_dm |>
dm::dm_flatten_to_tbl(transactions_1k, .recursive = TRUE) |>
dm::collect()
class(transactions_ft)
#> [1] "tbl_df" "tbl" "data.frame"
nrow(transactions_ft)
#> [1] 1000
dput(colnames(transactions_ft))
#> c("TransactionID", "Date", "Time", "CustomerID", "CardID", "GasStationID",
#> "ProductID", "Amount", "Price", "Segment.customers", "Currency",
#> "ChainID", "Country", "Segment.gasstations", "Description")
With this we already have in the transactions_ft
variable the flat table with the transactional data that we are going to
use. In the list of fields of this table, we can see that the fields
with the same name that were in several tables are now postfixed with
the name of the table they came from (Segment.customers and
Segment.gasstations).
Since we already have the result we need, we can disconnect from the database.
From the data in the flat table (variable
transactions_ft
), we are going to carry out a
multidimensional design.
We will start by exploring part of the data to get to know it better.
The transactions_1k table has the TransactionID field as its primary key. We check that this field is a surrogate key, autonumber, as shown below.
length(unique(transactions_ft$TransactionID))
#> [1] 1000
min(transactions_ft$TransactionID)
#> [1] 1
max(transactions_ft$TransactionID)
#> [1] 1000
As would be expected, a transaction would correspond to the payment of a customer, by means of a card, for one or several products at a gas station and at a specific moment.
We can verify that this is true by obtaining the number of distinct instances for the combinations of attributes of interest.
nrow(unique(transactions_ft[, c("Date", "Time", "CustomerID")]))
#> [1] 919
nrow(unique(transactions_ft[, c("Date", "Time", "CardID")]))
#> [1] 921
nrow(unique(transactions_ft[, c("Date", "Time", "CardID", "ProductID")]))
#> [1] 1000
In view of these results, we deduce that a customer has paid at the same time with more than one card: This has been possible because the time is given in minutes. By adding the product to the combination of attributes that includes card, it is observed that all the rows of the table are obtained.
In view of this, we consider that:
There is nothing remarkable in the rest of the data.
First, we are going to make a design with the finest granularity.
We are going to add an Hour field obtained from the Time field. We also add a processing date, assuming we know that transactions are processed 48 hours after they occur.
transactions_ft <- transactions_ft |>
dplyr::mutate(Hour = format(Time, format = "%H")) |>
dplyr::mutate(`Processing Date` = format(as.POSIXct(Date) +
lubridate::days(2), format = "%Y-%m-%d"))
We define the star schema by distributing the flat table attributes into facts and dimensions.
library(rolap)
s_finest <- rolap::star_schema() |>
rolap::define_facts(name = "Transaction Line",
measures = c("Amount", "Price")) |>
rolap::define_dimension(name = "Transaction",
attributes = c("Date", "Time", "CardID")) |>
rolap::define_dimension(name = "Who",
attributes = c("CardID", "CustomerID", "Segment.customers",
"Currency")) |>
rolap::define_dimension(name = "Where",
attributes = c("GasStationID", "ChainID", "Country",
"Segment.gasstations")) |>
rolap::define_dimension(name = "When",
attributes = c("Date", "Hour")) |>
rolap::define_dimension(name = "When Moment",
attributes = c("Time", "Hour")) |>
rolap::define_dimension(name = "When Processed",
attributes = c("Processing Date", "Hour")) |>
rolap::define_dimension(name = "What",
attributes = c("Description"))
The most remarkable thing about the definition is that the facts correspond to a transaction line, each line corresponding to a product in the transaction. On the other hand, each transaction is identified by the attributes Date, Time and CardID. Since these concepts are already represented in other dimensions, the surrogate key (foreign key in the facts) would be enough to represent the transaction, so Transaction would be a degenerate dimension (the dimension table is not needed, the foreign key in the facts is enough).
Subsequently, the dimensions could be enriched with other attributes available in the transactional system. In this case, we are going to leave them as they are.
From the star schema and the flat table, we obtain a star database with the tables of the dimensions and facts.
We can transform the database by renaming the measures of the facts and/or the attributes of the dimensions and also, if we are going to work in a RDBMS, transforming them to snake case style.
db_finest <- db_finest |>
rolap::snake_case() |>
rolap::set_attribute_names(
name = "who",
new = c(
"card",
"customer",
"segment",
"currency"
)
) |>
rolap::set_attribute_names(
name = "where",
new = c(
"gas_station",
"chain",
"country",
"segment"
)
) |>
rolap::set_attribute_names(
name = "what",
new = c(
"product"
)
)
For easier access to the result tables, we are going to transform
them into a list of tibble
objects and show their names and
number of instances.
db_tl <- db_finest |>
rolap::as_tibble_list()
names <- names(db_tl)
for (i in seq_along(db_tl)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: transaction, 921 rows
#> name: who, 902 rows
#> name: where, 437 rows
#> name: when, 748 rows
#> name: when_moment, 599 rows
#> name: when_processed, 748 rows
#> name: what, 27 rows
#> name: transaction_line, 1000 rows
To represent the result graphically, we export the star database in
the form of a dm
object. In addition, we select the tables
of the model: All except the table of the degenerate dimension.
db_dm <- db_finest |>
rolap::as_dm_class(pk_facts = FALSE)
tables <- db_finest |>
rolap::get_table_names()
# Degenerate dimension
tables <- base::setdiff(tables, 'transaction')
db_finest_dm <- db_dm[tables]
Below is the database schema.
It can be seen that a new measure has been added to the fact table (nrow_agg), this measure indicates for each row the number of rows in the original table from which it was obtained. In the definition of the facts you can indicate a specific name for it. In this case, each row corresponds to a row in the original table, therefore, all the values must be 1, as can be verified below.
In the star database, we can define that any one of the dimensions is
the role-playing dimension and the other dimensions are various roles of
that dimension, the only requirement is that all dimensions must have
the same number of attributes. More details can be found in the vignette
titled Definition of Role-Playing and Role Dimensions,
vignette("v10-rpd")
.
For the case we are solving, the definition is shown below.
db_finest <- db_finest |>
rolap::role_playing_dimension(
rpd = "when",
roles = c("when_processed")
)
The result can be seen in the instances of the star database tables, below.
db_tl <- db_finest |>
rolap::as_tibble_list()
names <- names(db_tl)
for (i in seq_along(db_tl)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: transaction, 921 rows
#> name: who, 902 rows
#> name: where, 437 rows
#> name: when, 1466 rows
#> name: when_moment, 599 rows
#> name: when_processed, 1466 rows
#> name: what, 27 rows
#> name: transaction_line, 1000 rows
We can see that the tables associated with the when dimension (with date and hour) contain the same number of instances, but a larger number than before since they include all the different instances of the two tables.
Once we have the star database tables in a dm
object,
exporting them to any RDBMS is immediate. In the example below, a local
SQLite database is used. dm
works with any RDBMS
supported by DBI
.
my_db <- DBI::dbConnect(RSQLite::SQLite())
my_db_finest_dm <- dm::copy_dm_to(my_db, db_finest_dm)
my_db_finest_dm
#> ── Table source ────────────────────────────────────────────────────────────────
#> src: sqlite 3.41.2 []
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `transaction_line`, `what`, `when`, `when_moment`, `when_processed`, … (7 total)
#> Columns: 31
#> Primary keys: 6
#> Foreign keys: 6
DBI::dbDisconnect(my_db)
Next we are going to perform a multidimensional design with a coarser granularity. To have somewhat different dates from the previous example, we are going to define a new date, assuming we know that transactions are paid 24 hours after they occur.
transactions_ft <- transactions_ft |>
dplyr::mutate(`Payment Date` = format(as.POSIXct(Date) +
lubridate::days(1), format = "%Y-%m-%d"))
To design the star schema, for each dimension we exclusively select the attributes that we consider necessary.
s_summary <- rolap::star_schema() |>
rolap::define_facts(name = "Transaction Summary",
measures = c("Amount", "Price"),
nrow_agg = "Transactions") |>
rolap::define_dimension(name = "Who Segment",
attributes = c("Segment.customers",
"Currency")) |>
rolap::define_dimension(name = "Where Chain",
attributes = c("ChainID", "Country",
"Segment.gasstations")) |>
rolap::define_dimension(name = "When",
attributes = c("Date", "Hour")) |>
rolap::define_dimension(name = "When Paid",
attributes = c("Payment Date", "Hour")) |>
rolap::define_dimension(name = "What",
attributes = c("Description"))
We have assumed that we are not interested in individual transactions but in a summary of them; neither the particular data of the clients, but only the associated segment and currency. We are not interested in gas stations either, but only the chain to which they belong and the rest of the dimension attributes. Finally, in this case, the dimension with the data of the processing date has been interpreted as payment date.
Each measure can be associated with a specific aggregation function, by default it is the sum which, in this case, can be considered adequate since adding to group the data the meaning of the measures is preserved (understanding Price as the total amount of the sale operation). Otherwise, it is advisable to define measures obtained from those available so that this principle is fulfilled.
Finally, indicate that this time a specific name has been assigned to the measure of the number of rows added: Transactions.
As before, from the star schema and the flat table, we obtain a star database with the tables of the dimensions and facts. We also transform the names of dimension attributes and define the role-playing and role dimensions.
db_summary <- rolap::star_database(s_summary, transactions_ft) |>
rolap::snake_case() |>
rolap::set_attribute_names(
name = "who_segment",
new = c(
"segment",
"currency"
)
) |>
rolap::set_attribute_names(
name = "where_chain",
new = c(
"chain",
"country",
"segment"
)
) |>
rolap::set_attribute_names(
name = "what",
new = c(
"product"
)
) |>
rolap::role_playing_dimension(
rpd = "when",
roles = c("when_paid")
)
Finally, we present the result obtained showing the names and number of rows of the tables. We check that the number of rows added matches the total number of rows in the initial flat table.
db_tl <- db_summary |>
rolap::as_tibble_list()
names <- names(db_tl)
for (i in seq_along(db_tl)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: who_segment, 6 rows
#> name: where_chain, 33 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: what, 27 rows
#> name: transaction_summary, 995 rows
sum(db_tl[['transaction_summary']]$transactions)
#> [1] 1000
If we compare it with the previous example, we verify that there are fewer instances in the when dimension tables because in this case the difference between dates was 24 hours (before it was 48 hours).
We also show its graphic representation.
db_summary_dm <- db_summary |>
rolap::as_dm_class(pk_facts = FALSE)
db_summary_dm |>
dm::dm_draw(view_type = "all")
Tables can be stored directly in any RDBMS, just as we have done before.
We can work with the star databases separately or in an integrated way by defining a constellation from them. On the constellation we can carry out the same operations that we perform on a star database.
ct <- rolap::constellation("CSS", db_finest, db_summary)
db_tl <- ct |>
rolap::as_tibble_list()
names <- names(db_tl)
for (i in seq_along(db_tl)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: transaction, 921 rows
#> name: what, 27 rows
#> name: when, 1965 rows
#> name: when_moment, 599 rows
#> name: when_paid, 1965 rows
#> name: when_processed, 1965 rows
#> name: where, 437 rows
#> name: where_chain, 33 rows
#> name: who, 902 rows
#> name: who_segment, 6 rows
#> name: transaction_line, 1000 rows
#> name: transaction_summary, 995 rows
ct_dm_all <- ct |>
rolap::as_dm_class(pk_facts = FALSE)
tables <- ct |>
rolap::get_table_names()
# Degenerate dimension
tables <- base::setdiff(tables, 'transaction')
ct_dm <-
ct_dm_all[tables]
my_db <- DBI::dbConnect(RSQLite::SQLite())
my_ct_dm <- dm::copy_dm_to(my_db, ct_dm)
my_ct_dm
#> ── Table source ────────────────────────────────────────────────────────────────
#> src: sqlite 3.41.2 []
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `transaction_line`, `transaction_summary`, `what`, `when`, `when_moment`, … (11 total)
#> Columns: 49
#> Primary keys: 9
#> Foreign keys: 11
DBI::dbDisconnect(my_db)
ct_dm |>
dm::dm_draw(view_type = "all")
If we look at the when dimension tables (with date and hour), it can be seen that they all have the same number of instances: as there was a table in common between both star databases, a conformed table has been generated from them and the information about role-playing and role dimensions has been automatically updated with the new situation for the joint schema, as shown below .
In this document we have started from transactional data in a RDBMS
and, through the dm
and rolap
packages, we
have obtained two star databases and/or a constellation also in RDBMS.
The dm
package facilitates access to the RDBMS, while the
rolap
package supports the definition of the necessary
transformations in this context.
The initial data schema is very close to a star schema although, as
explained in the design section, it is conceptually different
(particularly the concepts of transaction and transaction line). The
simple structure of this schema has only facilitated the process of
obtaining the starting flat table to work with the rolap
package.
In the R code that is included, for each function it is indicated by the prefix the package where it is defined.↩︎