Integration of dimension instances

Introduction

It is generally recommended to perform data cleansing as close as possible to the original data sources. Sometimes, the need to perform data cleaning operations is detected when the multidimensional system is already implemented, either because it hasn’t been detected in the data sources or because it comes from multiple data sources. For example, we can integrate several stars with common dimensions in a constellation and, when conforming the dimensions, we observe that there are instances that are actually the same and should be merged. For these cases, the functions described in this document have been developed.

The document is structured as follows: First, we show examples of functions that allow us to explore the instances of the dimensions; the following shows how to modify instances of dimensions in a star database for several attributes or for a single attribute; the next section explains how the integration of instance modification operations is done when we define a constellation; finally, we end with conclusions.

Exploration of dimensions

To show by example the available operations, we are going to use the multidimensional designs developed in the vignette titled Star databases and RDBMS through the dm and rolap packages, vignette("v20-rdbms-dm"), which are available as a star_database objects (db_summary and db_finest).

Consult tables and dimensions

We will start working on one of the star databases, db_summary. As shown below, we can get the names of the available tables and, exporting them as a list of tibble objects, we can see their number of instances.

library(rolap)

class(db_summary)
#> [1] "star_database"

names <- db_summary |>
  get_table_names()

db_tl <- db_summary |>
  as_tibble_list()

for (n in names) {
  cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction_summary, 995 rows
#> name: what, 27 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: where_chain, 33 rows
#> name: who_segment, 6 rows

If instead of all the tables, we want to focus on the dimensions, we can get their names using the following function.

db_summary |>
  get_dimension_names()
#> [1] "what"        "when"        "when_paid"   "where_chain" "who_segment"

Although the design includes all of the above dimensions, we have several dimensions that are role-playing and role dimensions, as shown below.

db_summary |>
  get_role_playing_dimension_names()
#> $rpd_1
#> [1] "when"      "when_paid"

This means that these dimensions are considered to be represented by a single table.

Detect similar attribute values in dimensions

The goal is to find dimension instances whose attribute values differ in possible different spellings (number of white space, case, use of punctuation). It does not detect abbreviations.

An initial exploration can be done considering individual attributes of the dimensions that are indicated or of all the dimensions, if none are indicated.

db_summary |>
  get_similar_attribute_values_individually()
#> $who_segment
#> list()
#> 
#> $where_chain
#> list()
#> 
#> $what
#> $what[[1]]
#> # A tibble: 2 × 1
#>   product 
#>   <chr>   
#> 1 Diesel  
#> 2 Diesel +
#> 
#> $what[[2]]
#> # A tibble: 2 × 1
#>   product  
#>   <chr>    
#> 1 Natural  
#> 2 Natural +
#> 
#> 
#> $when
#> list()

Since no dimension has been indicated, all the dimensions have been considered but the role-playing and role dimensions have been taken into account to show the results of only one of them.

In a complementary way, we could consider combinations of attributes within each dimension. If we do not indicate any dimension, the combination of all the attributes of each of the dimensions is considered, as shown below.

db_summary |>
  get_similar_attribute_values()
#> $who_segment
#> list()
#> 
#> $where_chain
#> list()
#> 
#> $what
#> $what[[1]]
#> # A tibble: 2 × 1
#>   product 
#>   <chr>   
#> 1 Diesel  
#> 2 Diesel +
#> 
#> $what[[2]]
#> # A tibble: 2 × 1
#>   product  
#>   <chr>    
#> 1 Natural  
#> 2 Natural +
#> 
#> 
#> $when
#> list()

To more clearly show how these functions work, we are going to modify a value in a test star database and perform the query on a single dimension.

db_test <- db_summary
(value_seg1 <- db_test$dimensions$where_chain$table$chain[3])
#> [1] "13"
(value_seg2 <- db_test$dimensions$where_chain$table$chain[4])
#> [1] "130"
db_test$dimensions$where_chain$table$chain[3] <- ' ,1-.% $1)='
db_test$dimensions$where_chain$table$chain[4] <- ' ,1-.% $1)='

db_test |>
  get_similar_attribute_values_individually('where_chain')
#> [[1]]
#> # A tibble: 2 × 1
#>   chain        
#>   <chr>        
#> 1 " ,1-.% $1)="
#> 2 "11"

db_test |>
  get_similar_attribute_values('where_chain')
#> [[1]]
#> # A tibble: 2 × 3
#>   chain         country segment        
#>   <chr>         <chr>   <chr>          
#> 1 " ,1-.% $1)=" CZE     Value for money
#> 2 "11"          CZE     Value for money

Get unique values of attributes in dimensions

If we focus on a dimension, we can query its attributes and get the unique value formed by combinations of them. We can indicate a set of attributes and, if we do not indicate any, all dimension attributes are considered.

db_test |> 
  get_attribute_names('where_chain')
#> [1] "chain"   "country" "segment"

db_test |>
  get_unique_attribute_values('where_chain', attributes = c('country', 'segment'))
#> # A tibble: 10 × 2
#>    country segment        
#>    <chr>   <chr>          
#>  1 CZE     Discount       
#>  2 CZE     Noname         
#>  3 CZE     Other          
#>  4 CZE     Premium        
#>  5 CZE     Value for money
#>  6 SVK     Discount       
#>  7 SVK     Noname         
#>  8 SVK     Other          
#>  9 SVK     Premium        
#> 10 SVK     Value for money

db_test |>
  get_unique_attribute_values('where_chain', attributes = 'chain')
#> # A tibble: 32 × 1
#>    chain        
#>    <chr>        
#>  1 " ,1-.% $1)="
#>  2 "1"          
#>  3 "11"         
#>  4 "14"         
#>  5 "15"         
#>  6 "19"         
#>  7 "2"          
#>  8 "21"         
#>  9 "219"        
#> 10 "22"         
#> # ℹ 22 more rows

db_test |>
  get_unique_attribute_values('where_chain')
#> # A tibble: 33 × 3
#>    chain         country segment        
#>    <chr>         <chr>   <chr>          
#>  1 " ,1-.% $1)=" CZE     Value for money
#>  2 " ,1-.% $1)=" SVK     Premium        
#>  3 "1"           CZE     Premium        
#>  4 "11"          CZE     Value for money
#>  5 "14"          CZE     Noname         
#>  6 "15"          CZE     Premium        
#>  7 "19"          CZE     Discount       
#>  8 "2"           CZE     Noname         
#>  9 "21"          CZE     Discount       
#> 10 "219"         SVK     Other          
#> # ℹ 23 more rows

Modifying attribute instances in dimensions

In this section we are going to show several cases of modification of attribute values. Surely, the most suitable thing would be to carry out the necessary modification operations in the data sources (or as close to them as possible). In any case, we are going to do them here to show the possibilities of the rolap package.

Modification of a single value indicating several attributes

First, let’s start by fixing the bug introduced in the test star database. For all of the above functions, we can define the parameter col_as_vector that indicates to display the values as a vector. We will see later that this can be useful to make changes to the values, since it is the way to indicate them.

db_test |>
  get_unique_attribute_values(
    'where_chain',
    attributes = c('chain', 'country'),
    col_as_vector = "As a vector"
  )
#> # A tibble: 33 × 3
#>    chain         country `As a vector`          
#>    <chr>         <chr>   <chr>                  
#>  1 " ,1-.% $1)=" CZE     c(' ,1-.% $1)=', 'CZE')
#>  2 " ,1-.% $1)=" SVK     c(' ,1-.% $1)=', 'SVK')
#>  3 "1"           CZE     c('1', 'CZE')          
#>  4 "11"          CZE     c('11', 'CZE')         
#>  5 "14"          CZE     c('14', 'CZE')         
#>  6 "15"          CZE     c('15', 'CZE')         
#>  7 "19"          CZE     c('19', 'CZE')         
#>  8 "2"           CZE     c('2', 'CZE')          
#>  9 "21"          CZE     c('21', 'CZE')         
#> 10 "219"         SVK     c('219', 'SVK')        
#> # ℹ 23 more rows

Using the following function, we can indicate one or more attributes and the old and new values for them. If we indicate several attributes, in each call we can only change a combination of values. In this way we restore the original values that we had saved, as you can check below.

db_summary2 <- db_test |>
  replace_attribute_values(
    name = 'where_chain',
    attributes = c('chain', 'country'),
    old = c(' ,1-.% $1)=', 'CZE'),
    new = c(value_seg1, 'CZE')
  ) |>
  replace_attribute_values(
    name = 'where_chain',
    attributes = c('chain', 'country'),
    old = c(' ,1-.% $1)=', 'SVK'),
    new = c(value_seg2, 'SVK')
  )

db_summary2 |>
  get_unique_attribute_values(
    'where_chain',
    attributes = c('chain', 'country'),
    col_as_vector = "As a vector"
  )
#> # A tibble: 33 × 3
#>    chain country `As a vector`  
#>    <chr> <chr>   <chr>          
#>  1 1     CZE     c('1', 'CZE')  
#>  2 11    CZE     c('11', 'CZE') 
#>  3 13    CZE     c('13', 'CZE') 
#>  4 130   SVK     c('130', 'SVK')
#>  5 14    CZE     c('14', 'CZE') 
#>  6 15    CZE     c('15', 'CZE') 
#>  7 19    CZE     c('19', 'CZE') 
#>  8 2     CZE     c('2', 'CZE')  
#>  9 21    CZE     c('21', 'CZE') 
#> 10 219   SVK     c('219', 'SVK')
#> # ℹ 23 more rows

Modifying multiple values of an attribute

If we want to modify the values of a single attribute, we can indicate several values that correspond to a new value.

In this case, we are going to make modifications in two of the dimensions. First, in dimension what.

t <- db_summary2 |>
  get_unique_attribute_values('what', col_as_vector = "As a vector")

Suppose that, in view of the available values, we consider that we do not need details of all those products, but that we need a simplification: The one shown below.

db_summary3 <- db_summary2 |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Autokosmet.'),
    new = c('Autokozmetik')
  ) |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Diesel aditiv'),
    new = c('Diesel +')
  ) |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Nafta Plus'),
    new = c('Nafta +')
  ) |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Nat.Super', 'Natural Plus', 'Natural Spec'),
    new = c('Natural +')
  ) |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Autoprísluš.', 'Dalnic.popl.', 'Knihy,nov.', 'LPG', 'Mytí vozidel',
            'Nemrz.kapal.', 'Obcerstvení', 'Oleje,tuky', 'Potraviny', 'Prev.náplne',
            'Provoz.nápl.', 'Umývanie voz', 'Zboží nesp.', 'Zpr.nakupu'),
    new = c('Other')
  ) |>
  group_dimension_instances(name = 'what')
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

We change the name of some products because there seems to be an error or they were spelled two different ways; in others cases we group several products into a new one, the largest grouping is for products that are renamed Other. Finally, if, as a result of the modifications, we have reduced the number of distinct instances of the dimension, when we finish making modifications, we must group the dimension instances to simplify it using the group_dimension_instances() function. As a side result, the facts may also be grouped.

The result for the dimension can be seen below.

db_summary3 |>
  get_unique_attribute_values('what')
#> # A tibble: 9 × 1
#>   product     
#>   <chr>       
#> 1 Additivum   
#> 2 Autokozmetik
#> 3 Diesel      
#> 4 Diesel +    
#> 5 Nafta       
#> 6 Nafta +     
#> 7 Natural     
#> 8 Natural +   
#> 9 Other

If we get the number of instances of the tables (as we did at the beginning of this document) we verify that the number of records in the fact table has been reduced. We also check that the number of original records (indicated by measure transactions) from which these have been obtained is maintained. That is, several fact table records have been added as a side effect of the dimension changes.

db_tl <- db_summary3 |>
  as_tibble_list()

for (n in names) {
  cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction_summary, 984 rows
#> name: what, 9 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: where_chain, 33 rows
#> name: who_segment, 6 rows

sum(db_tl[['transaction_summary']]$transactions)
#> [1] 1000

Modification of instances in role-playing and role dimensions

The two When dimensions are role-playing and role dimensions. That is, they have the same instances and one can be obtained from the other by renaming attributes if necessary and of course the table.

db_summary3 |>
  get_role_playing_dimension_names()
#> $rpd_1
#> [1] "when"      "when_paid"

db_summary3 |>
  get_unique_attribute_values('when')
#> # A tibble: 1,366 × 2
#>    date       hour        
#>    <chr>      <chr>       
#>  1 2012-08-23 "34800 secs"
#>  2 2012-08-23 "35400 secs"
#>  3 2012-08-23 "35460 secs"
#>  4 2012-08-23 "72720 secs"
#>  5 2012-08-23 "73140 secs"
#>  6 2012-08-23 "76800 secs"
#>  7 2012-08-23 "81720 secs"
#>  8 2012-08-23 "82740 secs"
#>  9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows

db_summary3 |>
  get_unique_attribute_values('when_paid')
#> # A tibble: 1,366 × 2
#>    date       hour        
#>    <chr>      <chr>       
#>  1 2012-08-23 "34800 secs"
#>  2 2012-08-23 "35400 secs"
#>  3 2012-08-23 "35460 secs"
#>  4 2012-08-23 "72720 secs"
#>  5 2012-08-23 "73140 secs"
#>  6 2012-08-23 "76800 secs"
#>  7 2012-08-23 "81720 secs"
#>  8 2012-08-23 "82740 secs"
#>  9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows

Let us suppose that we are not interested in the specific hours of the operations, but rather the part of the day in which they were carried out. Therefore, we define the modifications on one of the dimensions When.

db_summary4 <- db_summary3 |>
  replace_attribute_values(
    name = 'when',
    attributes = 'hour',
    old = c('05', '06', '07', '08', '09', '10', '11'),
    new = c('Morning')
  ) |>
  replace_attribute_values(
    name = 'when',
    attributes = 'hour',
    old = c('12', '13', '14', '15', '16'),
    new = c('Afternoon')
  ) |>
  replace_attribute_values(
    name = 'when',
    attributes = 'hour',
    old = c('17', '18', '19', '20'),
    new = c('Evening')
  ) |>
  replace_attribute_values(
    name = 'when',
    attributes = 'hour',
    old = c('21', '22', '23', '00', '01', '02', '03', '04'),
    new = c('Night')
  ) |>
  group_dimension_instances(name = 'when')
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

The result obtained in both dimensions is shown below. That is, the modifications have been propagated to the equivalent dimension.

db_summary4 |>
  get_unique_attribute_values('when')
#> # A tibble: 1,366 × 2
#>    date       hour        
#>    <chr>      <chr>       
#>  1 2012-08-23 "34800 secs"
#>  2 2012-08-23 "35400 secs"
#>  3 2012-08-23 "35460 secs"
#>  4 2012-08-23 "72720 secs"
#>  5 2012-08-23 "73140 secs"
#>  6 2012-08-23 "76800 secs"
#>  7 2012-08-23 "81720 secs"
#>  8 2012-08-23 "82740 secs"
#>  9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows

db_summary4 |>
  get_unique_attribute_values('when_paid')
#> # A tibble: 1,366 × 2
#>    date       hour        
#>    <chr>      <chr>       
#>  1 2012-08-23 "34800 secs"
#>  2 2012-08-23 "35400 secs"
#>  3 2012-08-23 "35460 secs"
#>  4 2012-08-23 "72720 secs"
#>  5 2012-08-23 "73140 secs"
#>  6 2012-08-23 "76800 secs"
#>  7 2012-08-23 "81720 secs"
#>  8 2012-08-23 "82740 secs"
#>  9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows

We do the same check as before on the number of instances of the tables and we observe that in this case the number of records in the fact table has also been reduced as a result of the aggregation carried out.

db_tl <- db_summary4 |>
  as_tibble_list()

for (n in names) {
  cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction_summary, 984 rows
#> name: what, 9 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: where_chain, 33 rows
#> name: who_segment, 6 rows

sum(db_tl[['transaction_summary']]$transactions)
#> [1] 1000

Modifications of instances in a constellation

To define a constellation we need at least two star databases. We consider the other star database from the starting example. We look at the instances of the what dimension.

db_finest |>
  get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 27 × 2
#>    product       `As a vector`     
#>    <chr>         <chr>             
#>  1 Additivum     c('Additivum')    
#>  2 Autokosmet.   c('Autokosmet.')  
#>  3 Autokozmetik  c('Autokozmetik') 
#>  4 Autoprísluš.  c('Autoprísluš.') 
#>  5 Dalnic.popl.  c('Dalnic.popl.') 
#>  6 Diesel        c('Diesel')       
#>  7 Diesel +      c('Diesel +')     
#>  8 Diesel aditiv c('Diesel aditiv')
#>  9 Knihy,nov.    c('Knihy,nov.')   
#> 10 LPG           c('LPG')          
#> # ℹ 17 more rows

Suppose some instances of this dimension are also modified, as shown below.

db_finest2 <- db_finest |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Additivum', 'Autokozmetik'),
    new = c('Other')
  ) |>
  group_dimension_instances(name = 'what')

db_finest2 |>
  get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 26 × 2
#>    product       `As a vector`     
#>    <chr>         <chr>             
#>  1 Autokosmet.   c('Autokosmet.')  
#>  2 Autoprísluš.  c('Autoprísluš.') 
#>  3 Dalnic.popl.  c('Dalnic.popl.') 
#>  4 Diesel        c('Diesel')       
#>  5 Diesel +      c('Diesel +')     
#>  6 Diesel aditiv c('Diesel aditiv')
#>  7 Knihy,nov.    c('Knihy,nov.')   
#>  8 LPG           c('LPG')          
#>  9 Mytí vozidel  c('Mytí vozidel') 
#> 10 Nafta         c('Nafta')        
#> # ℹ 16 more rows

We define the constellation and as a result the dimensions are integrated.

ct <- constellation("CSS", db_summary4, db_finest2)

class(ct)
#> [1] "star_database"

names <- ct |>
  get_table_names()

db_tl <- ct |>
  as_tibble_list()

for (n in names) {
  cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction, 921 rows
#> name: transaction_line, 1000 rows
#> name: transaction_summary, 984 rows
#> name: what, 29 rows
#> name: when, 1965 rows
#> name: when_moment, 599 rows
#> name: when_paid, 1965 rows
#> name: when_processed, 1965 rows
#> name: where, 437 rows
#> name: where_chain, 33 rows
#> name: who, 902 rows
#> name: who_segment, 6 rows

Transformations in just one star database

For the dimension When, transformations were defined in only one of the star databases. The role-playing and role dimensions of the two star databases are also integrated. As a result, modifications are carried over from one star database to the other.

ct |>
  get_role_playing_dimension_names()
#> $rpd_1
#> [1] "when"           "when_paid"      "when_processed"

ct |>
  get_unique_attribute_values('when_processed')
#> # A tibble: 1,965 × 2
#>    date       hour        
#>    <chr>      <chr>       
#>  1 2012-08-23 "34800 secs"
#>  2 2012-08-23 "35400 secs"
#>  3 2012-08-23 "35460 secs"
#>  4 2012-08-23 "72720 secs"
#>  5 2012-08-23 "73140 secs"
#>  6 2012-08-23 "76800 secs"
#>  7 2012-08-23 "81720 secs"
#>  8 2012-08-23 "82740 secs"
#>  9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,955 more rows

The result is shown for dimension when_processed, which is not part of the star database on which the modifications were defined but it is modified.

Transformations in both star databases

In the case of the what dimension, transformations have been carried out in both star databases: They are combined taking the new operations from one database to another.

ct |>
  get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 29 × 2
#>    product       `As a vector`     
#>    <chr>         <chr>             
#>  1 Additivum     c('Additivum')    
#>  2 Autokosmet.   c('Autokosmet.')  
#>  3 Autokozmetik  c('Autokozmetik') 
#>  4 Autoprísluš.  c('Autoprísluš.') 
#>  5 Dalnic.popl.  c('Dalnic.popl.') 
#>  6 Diesel        c('Diesel')       
#>  7 Diesel +      c('Diesel +')     
#>  8 Diesel aditiv c('Diesel aditiv')
#>  9 Knihy,nov.    c('Knihy,nov.')   
#> 10 LPG           c('LPG')          
#> # ℹ 19 more rows

The result obtained is explained because in star database db_summary the new operations of star database db_finest have been applied and vice versa. In the star database db_finest, the Autokosmet. to Autokozmetik transformation has been applied after the Autokozmetik to Other transformation, so the Autokozmetik value has been generated.

This situation has been sought in this example. When defining a constellation we must review the resulting dimensions. In any case, we can continue to make changes to the constellation, as shown below.

ct2 <- ct |>
  replace_attribute_values(
    name = 'what',
    attributes = 'product',
    old = c('Autokozmetik'),
    new = c('Other')
  ) |>
  group_dimension_instances(name = 'what')
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor

ct2 |>
  get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 28 × 2
#>    product       `As a vector`     
#>    <chr>         <chr>             
#>  1 Additivum     c('Additivum')    
#>  2 Autokosmet.   c('Autokosmet.')  
#>  3 Autoprísluš.  c('Autoprísluš.') 
#>  4 Dalnic.popl.  c('Dalnic.popl.') 
#>  5 Diesel        c('Diesel')       
#>  6 Diesel +      c('Diesel +')     
#>  7 Diesel aditiv c('Diesel aditiv')
#>  8 Knihy,nov.    c('Knihy,nov.')   
#>  9 LPG           c('LPG')          
#> 10 Mytí vozidel  c('Mytí vozidel') 
#> # ℹ 18 more rows

Conclusions

This document shows how to perform modification operations on dimension instances in both star databases and constellations. The modifications take into account the definition of role-playing and role dimensions both in a single database and in the integration of several to form a constellation.

In the case of performing operations on common dimensions of several star databases, the new operations are carried from one to another star database. If there is a conflict, we may need to define new operations on the constellation.

Although it is usually advisable to carry out the modification operations as close as possible to the data sources, this mechanism gives us flexibility especially when integrating several star databases into a constellation.