It is generally recommended to perform data cleansing as close as possible to the original data sources. Sometimes, the need to perform data cleaning operations is detected when the multidimensional system is already implemented, either because it hasn’t been detected in the data sources or because it comes from multiple data sources. For example, we can integrate several stars with common dimensions in a constellation and, when conforming the dimensions, we observe that there are instances that are actually the same and should be merged. For these cases, the functions described in this document have been developed.
The document is structured as follows: First, we show examples of functions that allow us to explore the instances of the dimensions; the following shows how to modify instances of dimensions in a star database for several attributes or for a single attribute; the next section explains how the integration of instance modification operations is done when we define a constellation; finally, we end with conclusions.
To show by example the available operations, we are going to use the
multidimensional designs developed in the vignette titled Star
databases and RDBMS through the dm
and rolap
packages, vignette("v20-rdbms-dm")
, which are
available as a star_database
objects
(db_summary
and db_finest
).
We will start working on one of the star databases,
db_summary
. As shown below, we can get the names of the
available tables and, exporting them as a list of tibble
objects, we can see their number of instances.
library(rolap)
class(db_summary)
#> [1] "star_database"
names <- db_summary |>
get_table_names()
db_tl <- db_summary |>
as_tibble_list()
for (n in names) {
cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction_summary, 995 rows
#> name: what, 27 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: where_chain, 33 rows
#> name: who_segment, 6 rows
If instead of all the tables, we want to focus on the dimensions, we can get their names using the following function.
Although the design includes all of the above dimensions, we have several dimensions that are role-playing and role dimensions, as shown below.
This means that these dimensions are considered to be represented by a single table.
The goal is to find dimension instances whose attribute values differ in possible different spellings (number of white space, case, use of punctuation). It does not detect abbreviations.
An initial exploration can be done considering individual attributes of the dimensions that are indicated or of all the dimensions, if none are indicated.
db_summary |>
get_similar_attribute_values_individually()
#> $who_segment
#> list()
#>
#> $where_chain
#> list()
#>
#> $what
#> $what[[1]]
#> # A tibble: 2 × 1
#> product
#> <chr>
#> 1 Diesel
#> 2 Diesel +
#>
#> $what[[2]]
#> # A tibble: 2 × 1
#> product
#> <chr>
#> 1 Natural
#> 2 Natural +
#>
#>
#> $when
#> list()
Since no dimension has been indicated, all the dimensions have been considered but the role-playing and role dimensions have been taken into account to show the results of only one of them.
In a complementary way, we could consider combinations of attributes within each dimension. If we do not indicate any dimension, the combination of all the attributes of each of the dimensions is considered, as shown below.
db_summary |>
get_similar_attribute_values()
#> $who_segment
#> list()
#>
#> $where_chain
#> list()
#>
#> $what
#> $what[[1]]
#> # A tibble: 2 × 1
#> product
#> <chr>
#> 1 Diesel
#> 2 Diesel +
#>
#> $what[[2]]
#> # A tibble: 2 × 1
#> product
#> <chr>
#> 1 Natural
#> 2 Natural +
#>
#>
#> $when
#> list()
To more clearly show how these functions work, we are going to modify a value in a test star database and perform the query on a single dimension.
db_test <- db_summary
(value_seg1 <- db_test$dimensions$where_chain$table$chain[3])
#> [1] "13"
(value_seg2 <- db_test$dimensions$where_chain$table$chain[4])
#> [1] "130"
db_test$dimensions$where_chain$table$chain[3] <- ' ,1-.% $1)='
db_test$dimensions$where_chain$table$chain[4] <- ' ,1-.% $1)='
db_test |>
get_similar_attribute_values_individually('where_chain')
#> [[1]]
#> # A tibble: 2 × 1
#> chain
#> <chr>
#> 1 " ,1-.% $1)="
#> 2 "11"
db_test |>
get_similar_attribute_values('where_chain')
#> [[1]]
#> # A tibble: 2 × 3
#> chain country segment
#> <chr> <chr> <chr>
#> 1 " ,1-.% $1)=" CZE Value for money
#> 2 "11" CZE Value for money
If we focus on a dimension, we can query its attributes and get the unique value formed by combinations of them. We can indicate a set of attributes and, if we do not indicate any, all dimension attributes are considered.
db_test |>
get_attribute_names('where_chain')
#> [1] "chain" "country" "segment"
db_test |>
get_unique_attribute_values('where_chain', attributes = c('country', 'segment'))
#> # A tibble: 10 × 2
#> country segment
#> <chr> <chr>
#> 1 CZE Discount
#> 2 CZE Noname
#> 3 CZE Other
#> 4 CZE Premium
#> 5 CZE Value for money
#> 6 SVK Discount
#> 7 SVK Noname
#> 8 SVK Other
#> 9 SVK Premium
#> 10 SVK Value for money
db_test |>
get_unique_attribute_values('where_chain', attributes = 'chain')
#> # A tibble: 32 × 1
#> chain
#> <chr>
#> 1 " ,1-.% $1)="
#> 2 "1"
#> 3 "11"
#> 4 "14"
#> 5 "15"
#> 6 "19"
#> 7 "2"
#> 8 "21"
#> 9 "219"
#> 10 "22"
#> # ℹ 22 more rows
db_test |>
get_unique_attribute_values('where_chain')
#> # A tibble: 33 × 3
#> chain country segment
#> <chr> <chr> <chr>
#> 1 " ,1-.% $1)=" CZE Value for money
#> 2 " ,1-.% $1)=" SVK Premium
#> 3 "1" CZE Premium
#> 4 "11" CZE Value for money
#> 5 "14" CZE Noname
#> 6 "15" CZE Premium
#> 7 "19" CZE Discount
#> 8 "2" CZE Noname
#> 9 "21" CZE Discount
#> 10 "219" SVK Other
#> # ℹ 23 more rows
In this section we are going to show several cases of modification of
attribute values. Surely, the most suitable thing would be to carry out
the necessary modification operations in the data sources (or as close
to them as possible). In any case, we are going to do them here to show
the possibilities of the rolap
package.
First, let’s start by fixing the bug introduced in the test star
database. For all of the above functions, we can define the parameter
col_as_vector
that indicates to display the values as a
vector. We will see later that this can be useful to make changes to the
values, since it is the way to indicate them.
db_test |>
get_unique_attribute_values(
'where_chain',
attributes = c('chain', 'country'),
col_as_vector = "As a vector"
)
#> # A tibble: 33 × 3
#> chain country `As a vector`
#> <chr> <chr> <chr>
#> 1 " ,1-.% $1)=" CZE c(' ,1-.% $1)=', 'CZE')
#> 2 " ,1-.% $1)=" SVK c(' ,1-.% $1)=', 'SVK')
#> 3 "1" CZE c('1', 'CZE')
#> 4 "11" CZE c('11', 'CZE')
#> 5 "14" CZE c('14', 'CZE')
#> 6 "15" CZE c('15', 'CZE')
#> 7 "19" CZE c('19', 'CZE')
#> 8 "2" CZE c('2', 'CZE')
#> 9 "21" CZE c('21', 'CZE')
#> 10 "219" SVK c('219', 'SVK')
#> # ℹ 23 more rows
Using the following function, we can indicate one or more attributes and the old and new values for them. If we indicate several attributes, in each call we can only change a combination of values. In this way we restore the original values that we had saved, as you can check below.
db_summary2 <- db_test |>
replace_attribute_values(
name = 'where_chain',
attributes = c('chain', 'country'),
old = c(' ,1-.% $1)=', 'CZE'),
new = c(value_seg1, 'CZE')
) |>
replace_attribute_values(
name = 'where_chain',
attributes = c('chain', 'country'),
old = c(' ,1-.% $1)=', 'SVK'),
new = c(value_seg2, 'SVK')
)
db_summary2 |>
get_unique_attribute_values(
'where_chain',
attributes = c('chain', 'country'),
col_as_vector = "As a vector"
)
#> # A tibble: 33 × 3
#> chain country `As a vector`
#> <chr> <chr> <chr>
#> 1 1 CZE c('1', 'CZE')
#> 2 11 CZE c('11', 'CZE')
#> 3 13 CZE c('13', 'CZE')
#> 4 130 SVK c('130', 'SVK')
#> 5 14 CZE c('14', 'CZE')
#> 6 15 CZE c('15', 'CZE')
#> 7 19 CZE c('19', 'CZE')
#> 8 2 CZE c('2', 'CZE')
#> 9 21 CZE c('21', 'CZE')
#> 10 219 SVK c('219', 'SVK')
#> # ℹ 23 more rows
If we want to modify the values of a single attribute, we can indicate several values that correspond to a new value.
In this case, we are going to make modifications in two of the dimensions. First, in dimension what.
Suppose that, in view of the available values, we consider that we do not need details of all those products, but that we need a simplification: The one shown below.
db_summary3 <- db_summary2 |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Autokosmet.'),
new = c('Autokozmetik')
) |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Diesel aditiv'),
new = c('Diesel +')
) |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Nafta Plus'),
new = c('Nafta +')
) |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Nat.Super', 'Natural Plus', 'Natural Spec'),
new = c('Natural +')
) |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Autoprísluš.', 'Dalnic.popl.', 'Knihy,nov.', 'LPG', 'Mytí vozidel',
'Nemrz.kapal.', 'Obcerstvení', 'Oleje,tuky', 'Potraviny', 'Prev.náplne',
'Provoz.nápl.', 'Umývanie voz', 'Zboží nesp.', 'Zpr.nakupu'),
new = c('Other')
) |>
group_dimension_instances(name = 'what')
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
We change the name of some products because there seems to be an
error or they were spelled two different ways; in others cases we group
several products into a new one, the largest grouping is for products
that are renamed Other. Finally, if, as a result of the
modifications, we have reduced the number of distinct instances of the
dimension, when we finish making modifications, we must group the
dimension instances to simplify it using the
group_dimension_instances()
function. As a side result, the
facts may also be grouped.
The result for the dimension can be seen below.
db_summary3 |>
get_unique_attribute_values('what')
#> # A tibble: 9 × 1
#> product
#> <chr>
#> 1 Additivum
#> 2 Autokozmetik
#> 3 Diesel
#> 4 Diesel +
#> 5 Nafta
#> 6 Nafta +
#> 7 Natural
#> 8 Natural +
#> 9 Other
If we get the number of instances of the tables (as we did at the
beginning of this document) we verify that the number of records in the
fact table has been reduced. We also check that the number of original
records (indicated by measure transactions
) from which
these have been obtained is maintained. That is, several fact table
records have been added as a side effect of the dimension changes.
db_tl <- db_summary3 |>
as_tibble_list()
for (n in names) {
cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction_summary, 984 rows
#> name: what, 9 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: where_chain, 33 rows
#> name: who_segment, 6 rows
sum(db_tl[['transaction_summary']]$transactions)
#> [1] 1000
The two When dimensions are role-playing and role dimensions. That is, they have the same instances and one can be obtained from the other by renaming attributes if necessary and of course the table.
db_summary3 |>
get_role_playing_dimension_names()
#> $rpd_1
#> [1] "when" "when_paid"
db_summary3 |>
get_unique_attribute_values('when')
#> # A tibble: 1,366 × 2
#> date hour
#> <chr> <chr>
#> 1 2012-08-23 "34800 secs"
#> 2 2012-08-23 "35400 secs"
#> 3 2012-08-23 "35460 secs"
#> 4 2012-08-23 "72720 secs"
#> 5 2012-08-23 "73140 secs"
#> 6 2012-08-23 "76800 secs"
#> 7 2012-08-23 "81720 secs"
#> 8 2012-08-23 "82740 secs"
#> 9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows
db_summary3 |>
get_unique_attribute_values('when_paid')
#> # A tibble: 1,366 × 2
#> date hour
#> <chr> <chr>
#> 1 2012-08-23 "34800 secs"
#> 2 2012-08-23 "35400 secs"
#> 3 2012-08-23 "35460 secs"
#> 4 2012-08-23 "72720 secs"
#> 5 2012-08-23 "73140 secs"
#> 6 2012-08-23 "76800 secs"
#> 7 2012-08-23 "81720 secs"
#> 8 2012-08-23 "82740 secs"
#> 9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows
Let us suppose that we are not interested in the specific hours of the operations, but rather the part of the day in which they were carried out. Therefore, we define the modifications on one of the dimensions When.
db_summary4 <- db_summary3 |>
replace_attribute_values(
name = 'when',
attributes = 'hour',
old = c('05', '06', '07', '08', '09', '10', '11'),
new = c('Morning')
) |>
replace_attribute_values(
name = 'when',
attributes = 'hour',
old = c('12', '13', '14', '15', '16'),
new = c('Afternoon')
) |>
replace_attribute_values(
name = 'when',
attributes = 'hour',
old = c('17', '18', '19', '20'),
new = c('Evening')
) |>
replace_attribute_values(
name = 'when',
attributes = 'hour',
old = c('21', '22', '23', '00', '01', '02', '03', '04'),
new = c('Night')
) |>
group_dimension_instances(name = 'when')
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
The result obtained in both dimensions is shown below. That is, the modifications have been propagated to the equivalent dimension.
db_summary4 |>
get_unique_attribute_values('when')
#> # A tibble: 1,366 × 2
#> date hour
#> <chr> <chr>
#> 1 2012-08-23 "34800 secs"
#> 2 2012-08-23 "35400 secs"
#> 3 2012-08-23 "35460 secs"
#> 4 2012-08-23 "72720 secs"
#> 5 2012-08-23 "73140 secs"
#> 6 2012-08-23 "76800 secs"
#> 7 2012-08-23 "81720 secs"
#> 8 2012-08-23 "82740 secs"
#> 9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows
db_summary4 |>
get_unique_attribute_values('when_paid')
#> # A tibble: 1,366 × 2
#> date hour
#> <chr> <chr>
#> 1 2012-08-23 "34800 secs"
#> 2 2012-08-23 "35400 secs"
#> 3 2012-08-23 "35460 secs"
#> 4 2012-08-23 "72720 secs"
#> 5 2012-08-23 "73140 secs"
#> 6 2012-08-23 "76800 secs"
#> 7 2012-08-23 "81720 secs"
#> 8 2012-08-23 "82740 secs"
#> 9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,356 more rows
We do the same check as before on the number of instances of the tables and we observe that in this case the number of records in the fact table has also been reduced as a result of the aggregation carried out.
db_tl <- db_summary4 |>
as_tibble_list()
for (n in names) {
cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction_summary, 984 rows
#> name: what, 9 rows
#> name: when, 1366 rows
#> name: when_paid, 1366 rows
#> name: where_chain, 33 rows
#> name: who_segment, 6 rows
sum(db_tl[['transaction_summary']]$transactions)
#> [1] 1000
To define a constellation we need at least two star databases. We consider the other star database from the starting example. We look at the instances of the what dimension.
db_finest |>
get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 27 × 2
#> product `As a vector`
#> <chr> <chr>
#> 1 Additivum c('Additivum')
#> 2 Autokosmet. c('Autokosmet.')
#> 3 Autokozmetik c('Autokozmetik')
#> 4 Autoprísluš. c('Autoprísluš.')
#> 5 Dalnic.popl. c('Dalnic.popl.')
#> 6 Diesel c('Diesel')
#> 7 Diesel + c('Diesel +')
#> 8 Diesel aditiv c('Diesel aditiv')
#> 9 Knihy,nov. c('Knihy,nov.')
#> 10 LPG c('LPG')
#> # ℹ 17 more rows
Suppose some instances of this dimension are also modified, as shown below.
db_finest2 <- db_finest |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Additivum', 'Autokozmetik'),
new = c('Other')
) |>
group_dimension_instances(name = 'what')
db_finest2 |>
get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 26 × 2
#> product `As a vector`
#> <chr> <chr>
#> 1 Autokosmet. c('Autokosmet.')
#> 2 Autoprísluš. c('Autoprísluš.')
#> 3 Dalnic.popl. c('Dalnic.popl.')
#> 4 Diesel c('Diesel')
#> 5 Diesel + c('Diesel +')
#> 6 Diesel aditiv c('Diesel aditiv')
#> 7 Knihy,nov. c('Knihy,nov.')
#> 8 LPG c('LPG')
#> 9 Mytí vozidel c('Mytí vozidel')
#> 10 Nafta c('Nafta')
#> # ℹ 16 more rows
We define the constellation and as a result the dimensions are integrated.
ct <- constellation("CSS", db_summary4, db_finest2)
class(ct)
#> [1] "star_database"
names <- ct |>
get_table_names()
db_tl <- ct |>
as_tibble_list()
for (n in names) {
cat(sprintf("name: %s, %d rows\n", n, nrow(db_tl[[n]])))
}
#> name: transaction, 921 rows
#> name: transaction_line, 1000 rows
#> name: transaction_summary, 984 rows
#> name: what, 29 rows
#> name: when, 1965 rows
#> name: when_moment, 599 rows
#> name: when_paid, 1965 rows
#> name: when_processed, 1965 rows
#> name: where, 437 rows
#> name: where_chain, 33 rows
#> name: who, 902 rows
#> name: who_segment, 6 rows
For the dimension When, transformations were defined in only one of the star databases. The role-playing and role dimensions of the two star databases are also integrated. As a result, modifications are carried over from one star database to the other.
ct |>
get_role_playing_dimension_names()
#> $rpd_1
#> [1] "when" "when_paid" "when_processed"
ct |>
get_unique_attribute_values('when_processed')
#> # A tibble: 1,965 × 2
#> date hour
#> <chr> <chr>
#> 1 2012-08-23 "34800 secs"
#> 2 2012-08-23 "35400 secs"
#> 3 2012-08-23 "35460 secs"
#> 4 2012-08-23 "72720 secs"
#> 5 2012-08-23 "73140 secs"
#> 6 2012-08-23 "76800 secs"
#> 7 2012-08-23 "81720 secs"
#> 8 2012-08-23 "82740 secs"
#> 9 2012-08-23 "82800 secs"
#> 10 2012-08-24 " 2040 secs"
#> # ℹ 1,955 more rows
The result is shown for dimension when_processed, which is not part of the star database on which the modifications were defined but it is modified.
In the case of the what dimension, transformations have been carried out in both star databases: They are combined taking the new operations from one database to another.
ct |>
get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 29 × 2
#> product `As a vector`
#> <chr> <chr>
#> 1 Additivum c('Additivum')
#> 2 Autokosmet. c('Autokosmet.')
#> 3 Autokozmetik c('Autokozmetik')
#> 4 Autoprísluš. c('Autoprísluš.')
#> 5 Dalnic.popl. c('Dalnic.popl.')
#> 6 Diesel c('Diesel')
#> 7 Diesel + c('Diesel +')
#> 8 Diesel aditiv c('Diesel aditiv')
#> 9 Knihy,nov. c('Knihy,nov.')
#> 10 LPG c('LPG')
#> # ℹ 19 more rows
The result obtained is explained because in star database db_summary the new operations of star database db_finest have been applied and vice versa. In the star database db_finest, the Autokosmet. to Autokozmetik transformation has been applied after the Autokozmetik to Other transformation, so the Autokozmetik value has been generated.
This situation has been sought in this example. When defining a constellation we must review the resulting dimensions. In any case, we can continue to make changes to the constellation, as shown below.
ct2 <- ct |>
replace_attribute_values(
name = 'what',
attributes = 'product',
old = c('Autokozmetik'),
new = c('Other')
) |>
group_dimension_instances(name = 'what')
#> Warning in res$name == name: longitud de objeto mayor no es múltiplo de la
#> longitud de uno menor
ct2 |>
get_unique_attribute_values('what', col_as_vector = "As a vector")
#> # A tibble: 28 × 2
#> product `As a vector`
#> <chr> <chr>
#> 1 Additivum c('Additivum')
#> 2 Autokosmet. c('Autokosmet.')
#> 3 Autoprísluš. c('Autoprísluš.')
#> 4 Dalnic.popl. c('Dalnic.popl.')
#> 5 Diesel c('Diesel')
#> 6 Diesel + c('Diesel +')
#> 7 Diesel aditiv c('Diesel aditiv')
#> 8 Knihy,nov. c('Knihy,nov.')
#> 9 LPG c('LPG')
#> 10 Mytí vozidel c('Mytí vozidel')
#> # ℹ 18 more rows
This document shows how to perform modification operations on dimension instances in both star databases and constellations. The modifications take into account the definition of role-playing and role dimensions both in a single database and in the integration of several to form a constellation.
In the case of performing operations on common dimensions of several star databases, the new operations are carried from one to another star database. If there is a conflict, we may need to define new operations on the constellation.
Although it is usually advisable to carry out the modification operations as close as possible to the data sources, this mechanism gives us flexibility especially when integrating several star databases into a constellation.