Star databases and RDBMS through the dm and rolap packages

Introduction

The main objective of this document is to show how to access transactional data in RDBMS (Relational Database Management Systems) and store the star database generated from it in another relational database. Access to the databases is done through the dm package. The transformations are carried out using the rolap package.

Additionally, a multidimensional design case is solved using star databases with two granularities.

First, the initial transactional data is presented. The following shows the implementation of the star databases. Finally, it ends with the conclusions.

Transactional data

We will build on a dataset with transactional data from Czech debit card company specialising on payments at gas stations called CCS, publicly available in a RDBMS.

First, we make a connection by following the instructions they offer, as shown below1.

ccs_db <- RMariaDB::dbConnect(
  RMariaDB::MariaDB(),
  username = "guest",
  password = "relational",
  dbname = "ccs",
  host = "relational.fit.cvut.cz"
)

Next, from the connection, we can obtain a dm object, taking the definition of the keys from the database.

ccs_dm <- dm::dm_from_con(ccs_db, learn_keys = TRUE)

Using the dm::dm_draw(view_type = "all") function we can display the database tables with all their fields.

We can access tables as if they were part of a list. If we access the transactions table we can see that it has no data, the data is in the transactions_1k table hat is not related to the other tables.

Therefore, what we do is generate a new dm object by selecting the tables that we are going to use and relating table transactions_1k to the rest of the tables. These operations are performed below.

ccs_sel_dm <-
  ccs_dm[c('transactions_1k', 'customers', 'gasstations', 'products')] |>
  dm::dm_add_fk(transactions_1k, CustomerID, customers) |>
  dm::dm_add_fk(transactions_1k, GasStationID, gasstations) |>
  dm::dm_add_fk(transactions_1k, ProductID, products)

The resulting database schema is shown below.

ccs_sel_dm |>
  dm::dm_draw(view_type = "all")

The starting point of the rolap package to define the transformations to obtain a star database is a flat table. To generate a flat table in a tibble from the database tables we can use the functions offered by the dm package, as shown below.

transactions_ft <- ccs_sel_dm |>
  dm::dm_flatten_to_tbl(transactions_1k, .recursive = TRUE) |>
  dm::collect()

class(transactions_ft)
#> [1] "tbl_df"     "tbl"        "data.frame"
nrow(transactions_ft)
#> [1] 1000
dput(colnames(transactions_ft))
#> c("TransactionID", "Date", "Time", "CustomerID", "CardID", "GasStationID", 
#> "ProductID", "Amount", "Price", "Segment.customers", "Currency", 
#> "ChainID", "Country", "Segment.gasstations", "Description")

With this we already have in the transactions_ft variable the flat table with the transactional data that we are going to use. In the list of fields of this table, we can see that the fields with the same name that were in several tables are now postfixed with the name of the table they came from (Segment.customers and Segment.gasstations).

Since we already have the result we need, we can disconnect from the database.

DBI::dbDisconnect(ccs_db)

Star databases

From the data in the flat table (variable transactions_ft), we are going to carry out a multidimensional design.

We will start by exploring part of the data to get to know it better.

Data exploration

The transactions_1k table has the TransactionID field as its primary key. We check that this field is a surrogate key, autonumber, as shown below.

length(unique(transactions_ft$TransactionID))
#> [1] 1000
min(transactions_ft$TransactionID)
#> [1] 1
max(transactions_ft$TransactionID)
#> [1] 1000

As would be expected, a transaction would correspond to the payment of a customer, by means of a card, for one or several products at a gas station and at a specific moment.

We can verify that this is true by obtaining the number of distinct instances for the combinations of attributes of interest.

nrow(unique(transactions_ft[, c("Date", "Time", "CustomerID")]))
#> [1] 919
nrow(unique(transactions_ft[, c("Date", "Time", "CardID")]))
#> [1] 921
nrow(unique(transactions_ft[, c("Date", "Time", "CardID", "ProductID")]))
#> [1] 1000

In view of these results, we deduce that a customer has paid at the same time with more than one card: This has been possible because the time is given in minutes. By adding the product to the combination of attributes that includes card, it is observed that all the rows of the table are obtained.

In view of this, we consider that:

There is nothing remarkable in the rest of the data.

Design with the finest possible granularity

First, we are going to make a design with the finest granularity.

We are going to add an Hour field obtained from the Time field. We also add a processing date, assuming we know that transactions are processed 48 hours after they occur.

transactions_ft <- transactions_ft |>
  dplyr::mutate(Hour = format(Time, format = "%H"))  |>
  dplyr::mutate(`Processing Date` = format(as.POSIXct(Date) + 
    lubridate::days(2), format = "%Y-%m-%d"))

Star schema

We define the star schema by distributing the flat table attributes into facts and dimensions.

library(rolap)

s_finest <- rolap::star_schema() |>
  rolap::define_facts(name = "Transaction Line",
    measures = c("Amount", "Price")) |>
  rolap::define_dimension(name = "Transaction",
    attributes = c("Date", "Time", "CardID")) |>
  rolap::define_dimension(name = "Who",
    attributes = c("CardID", "CustomerID", "Segment.customers",
                   "Currency")) |>
  rolap::define_dimension(name = "Where",
    attributes = c("GasStationID", "ChainID", "Country",
                   "Segment.gasstations")) |>
  rolap::define_dimension(name = "When",
    attributes = c("Date", "Hour")) |>
  rolap::define_dimension(name = "When Moment",
    attributes = c("Time", "Hour")) |>
  rolap::define_dimension(name = "When Processed",
    attributes = c("Processing Date", "Hour")) |>
  rolap::define_dimension(name = "What",
    attributes = c("Description"))

The most remarkable thing about the definition is that the facts correspond to a transaction line, each line corresponding to a product in the transaction. On the other hand, each transaction is identified by the attributes Date, Time and CardID. Since these concepts are already represented in other dimensions, the surrogate key (foreign key in the facts) would be enough to represent the transaction, so Transaction would be a degenerate dimension (the dimension table is not needed, the foreign key in the facts is enough).

Subsequently, the dimensions could be enriched with other attributes available in the transactional system. In this case, we are going to leave them as they are.

Star database

From the star schema and the flat table, we obtain a star database with the tables of the dimensions and facts.

db_finest <- rolap::star_database(s_finest, transactions_ft)

We can transform the database by renaming the measures of the facts and/or the attributes of the dimensions and also, if we are going to work in a RDBMS, transforming them to snake case style.

db_finest <- db_finest |>
  rolap::snake_case() |>
  rolap::set_attribute_names(
    name = "who",
    new = c(
      "card",
      "customer",
      "segment",
      "currency"
    )
  ) |>
  rolap::set_attribute_names(
    name = "where",
    new = c(
      "gas_station",
      "chain",
      "country",
      "segment"
    )
  ) |>
  rolap::set_attribute_names(
    name = "what",
    new = c(
      "product"
    )
  )

Presentation of the result

For easier access to the result tables, we are going to transform them into a list of tibble objects and show their names and number of instances.

db_tl <- db_finest |>
  rolap::as_tibble_list()

names <- names(db_tl)
for (i in seq_along(db_tl)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: transaction, 921 rows
#> name: who, 902 rows
#> name: where, 437 rows
#> name: when, 748 rows
#> name: when_moment, 599 rows
#> name: when_processed, 748 rows
#> name: what, 27 rows
#> name: transaction_line, 1000 rows

To represent the result graphically, we export the star database in the form of a dm object. In addition, we select the tables of the model: All except the table of the degenerate dimension.

db_dm <- db_finest |>
  rolap::as_dm_class(pk_facts = FALSE)

tables <- db_finest |>
  rolap::get_table_names()

# Degenerate dimension
tables <- base::setdiff(tables, 'transaction')

db_finest_dm <- db_dm[tables]

Below is the database schema.

db_finest_dm |> 
  dm::dm_draw(view_type = "all")

It can be seen that a new measure has been added to the fact table (nrow_agg), this measure indicates for each row the number of rows in the original table from which it was obtained. In the definition of the facts you can indicate a specific name for it. In this case, each row corresponds to a row in the original table, therefore, all the values must be 1, as can be verified below.

max(db_tl[['transaction_line']]$nrow_agg)
#> [1] 1
sum(db_tl[['transaction_line']]$nrow_agg)
#> [1] 1000

Definition of role-playing and role dimensions

In the star database, we can define that any one of the dimensions is the role-playing dimension and the other dimensions are various roles of that dimension, the only requirement is that all dimensions must have the same number of attributes. More details can be found in the vignette titled Definition of Role-Playing and Role Dimensions, vignette("v10-rpd").

For the case we are solving, the definition is shown below.

db_finest <- db_finest |>
  rolap::role_playing_dimension(
    rpd = "when",
    roles = c("when_processed")
  )

The result can be seen in the instances of the star database tables, below.

db_tl <- db_finest |>
  rolap::as_tibble_list()

names <- names(db_tl)
for (i in seq_along(db_tl)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: transaction, 921 rows
#> name: who, 902 rows
#> name: where, 437 rows
#> name: when, 1466 rows
#> name: when_moment, 599 rows
#> name: when_processed, 1466 rows
#> name: what, 27 rows
#> name: transaction_line, 1000 rows

We can see that the tables associated with the when dimension (with date and hour) contain the same number of instances, but a larger number than before since they include all the different instances of the two tables.

Store tables in a relational database

Once we have the star database tables in a dm object, exporting them to any RDBMS is immediate. In the example below, a local SQLite database is used. dm works with any RDBMS supported by DBI.

my_db <- DBI::dbConnect(RSQLite::SQLite())
my_db_finest_dm <- dm::copy_dm_to(my_db, db_finest_dm)
my_db_finest_dm
#> ── Table source ────────────────────────────────────────────────────────────────
#> src:  sqlite 3.41.2 []
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `transaction_line`, `what`, `when`, `when_moment`, `when_processed`, … (7 total)
#> Columns: 31
#> Primary keys: 6
#> Foreign keys: 6
DBI::dbDisconnect(my_db)

Design with a summary of the data

Next we are going to perform a multidimensional design with a coarser granularity. To have somewhat different dates from the previous example, we are going to define a new date, assuming we know that transactions are paid 24 hours after they occur.

transactions_ft <- transactions_ft |>
  dplyr::mutate(`Payment Date` = format(as.POSIXct(Date) + 
    lubridate::days(1), format = "%Y-%m-%d"))

Star schema

To design the star schema, for each dimension we exclusively select the attributes that we consider necessary.

s_summary <- rolap::star_schema() |>
  rolap::define_facts(name = "Transaction Summary",
    measures = c("Amount", "Price"),
    nrow_agg = "Transactions") |>
  rolap::define_dimension(name = "Who Segment",
    attributes = c("Segment.customers",
                   "Currency")) |>
  rolap::define_dimension(name = "Where Chain",
    attributes = c("ChainID", "Country",
                   "Segment.gasstations")) |>
  rolap::define_dimension(name = "When",
    attributes = c("Date", "Hour")) |>
  rolap::define_dimension(name = "When Paid",
    attributes = c("Payment Date", "Hour")) |>
  rolap::define_dimension(name = "What",
    attributes = c("Description"))

We have assumed that we are not interested in individual transactions but in a summary of them; neither the particular data of the clients, but only the associated segment and currency. We are not interested in gas stations either, but only the chain to which they belong and the rest of the dimension attributes. Finally, in this case, the dimension with the data of the processing date has been interpreted as payment date.

Each measure can be associated with a specific aggregation function, by default it is the sum which, in this case, can be considered adequate since adding to group the data the meaning of the measures is preserved (understanding Price as the total amount of the sale operation). Otherwise, it is advisable to define measures obtained from those available so that this principle is fulfilled.

Finally, indicate that this time a specific name has been assigned to the measure of the number of rows added: Transactions.

Star database

As before, from the star schema and the flat table, we obtain a star database with the tables of the dimensions and facts. We also transform the names of dimension attributes and define the role-playing and role dimensions.

db_summary <- rolap::star_database(s_summary, transactions_ft) |>
  rolap::snake_case() |>
  rolap::set_attribute_names(
    name = "who_segment",
    new = c(
      "segment",
      "currency"
    )
  ) |>
  rolap::set_attribute_names(
    name = "where_chain",
    new = c(
      "chain",
      "country",
      "segment"
    )
  ) |>
  rolap::set_attribute_names(
    name = "what",
    new = c(
      "product"
    )
  ) |>
  rolap::role_playing_dimension(
    rpd = "when",
    roles = c("when_paid")
  )

Finally, we present the result obtained showing the names and number of rows of the tables. We check that the number of rows added matches the total number of rows in the initial flat table.

db_tl <- db_summary |>
  rolap::as_tibble_list()

names <- names(db_tl)
for (i in seq_along(db_tl)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: who_segment, 6 rows
#> name: where_chain, 33 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: what, 27 rows
#> name: transaction_summary, 995 rows

sum(db_tl[['transaction_summary']]$transactions)
#> [1] 1000

If we compare it with the previous example, we verify that there are fewer instances in the when dimension tables because in this case the difference between dates was 24 hours (before it was 48 hours).

We also show its graphic representation.

db_summary_dm <- db_summary |>
  rolap::as_dm_class(pk_facts = FALSE)

db_summary_dm |> 
  dm::dm_draw(view_type = "all")

Tables can be stored directly in any RDBMS, just as we have done before.

Constellation

We can work with the star databases separately or in an integrated way by defining a constellation from them. On the constellation we can carry out the same operations that we perform on a star database.

ct <- rolap::constellation("CSS", db_finest, db_summary)

db_tl <- ct |>
  rolap::as_tibble_list()

names <- names(db_tl)
for (i in seq_along(db_tl)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(db_tl[[i]])))
}
#> name: transaction, 921 rows
#> name: what, 27 rows
#> name: when, 1965 rows
#> name: when_moment, 599 rows
#> name: when_paid, 1965 rows
#> name: when_processed, 1965 rows
#> name: where, 437 rows
#> name: where_chain, 33 rows
#> name: who, 902 rows
#> name: who_segment, 6 rows
#> name: transaction_line, 1000 rows
#> name: transaction_summary, 995 rows

ct_dm_all <- ct |>
  rolap::as_dm_class(pk_facts = FALSE)

tables <- ct |>
  rolap::get_table_names()

# Degenerate dimension
tables <- base::setdiff(tables, 'transaction')

ct_dm <-
  ct_dm_all[tables]

my_db <- DBI::dbConnect(RSQLite::SQLite())
my_ct_dm <- dm::copy_dm_to(my_db, ct_dm)
my_ct_dm
#> ── Table source ────────────────────────────────────────────────────────────────
#> src:  sqlite 3.41.2 []
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `transaction_line`, `transaction_summary`, `what`, `when`, `when_moment`, … (11 total)
#> Columns: 49
#> Primary keys: 9
#> Foreign keys: 11
DBI::dbDisconnect(my_db)

ct_dm |> 
  dm::dm_draw(view_type = "all")

If we look at the when dimension tables (with date and hour), it can be seen that they all have the same number of instances: as there was a table in common between both star databases, a conformed table has been generated from them and the information about role-playing and role dimensions has been automatically updated with the new situation for the joint schema, as shown below .

ct |>
  rolap::get_role_playing_dimension_names()
#> $rpd_1
#> [1] "when"           "when_paid"      "when_processed"

Conclusions

In this document we have started from transactional data in a RDBMS and, through the dm and rolap packages, we have obtained two star databases and/or a constellation also in RDBMS. The dm package facilitates access to the RDBMS, while the rolap package supports the definition of the necessary transformations in this context.

The initial data schema is very close to a star schema although, as explained in the design section, it is conceptually different (particularly the concepts of transaction and transaction line). The simple structure of this schema has only facilitated the process of obtaining the starting flat table to work with the rolap package.


  1. In the R code that is included, for each function it is indicated by the prefix the package where it is defined.↩︎