Motivation

The motivation of this project was to create a frictionless approach to quickly viewing summary statistics as part of a pipeline. There are many existing summary functions, but we found them lacking in one way or another because they can be generic, they don’t always provide easy-to-operate-on data structures, and they are not pipeable.

So at rOpenSci #unconf17, we created a new package that would let you quickly skim useful, tidy summary statistics directly from a pipe.

And so we created skimr.

In a nutshell, skimr will create a skimr object that can be further operated upon or that provides a human-readable printout in the console. It presents reasonable default summary statistics for numerics, factors, etc, and lists counts, and missing and unique values.

Dev process

About Team Members

Amelia McNamara
Job Title: Visiting Assistant Professor of Statistical & Data Sciences at Smith College
Project Contributions: Coder

Eduardo Arino de la Rubia
Job Title: Chief Data Scientist at Domino Data Lab
Project Contributions: Coder

Hao Zhu
Job Title: Programmer Analyst at the Institute for Aging Research
Project Contributions: Coder

Julia Lowndes
Job Title: Marine Data Scientist at the National Center for Ecological Analysis and Synthesis
Project Contributions: Documention and test scripts

Shannon Ellis
Job Title: Postdoctoral fellow in the Biostatistics Department at the Johns Hopkins Bloomberg School of Public Health
Project Contributions: Test Scripts

Elin Waring
Job Title: Professor at Lehman College Sociology Department, City University of New York
Project Contributions: Coder

Michael Quinn
Job Title: Quantitative Analyst at Google
Project Contributions: Coder

Hope McLeod
Job Title: Data Engineer at Kobalt Music
Project Contributions: Documentation

  • how we pulled it all together
  • next steps/future (issues for later)

Step-by-step thought process

We started off by brainstorming what we liked about existing summary packages and what other features we wanted. We started looking at example data, mtcars.

str(mtcars)
summary(mtcars)

What we like and dislike, in Amelia’s words

# "I like what we get here because mpg is numeric so these stats make sense:" 
summary(mtcars$mpg) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90
# "But I don’t like this because cyl should really be a factor and shouldn't have these stats:"
summary(mtcars$cyl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   6.000   6.188   8.000   8.000
# "This is OK, but not descriptive enough. It could be clearer what I'm looking at."
mosaic::tally(~cyl, data=mtcars) # install.packages('mosaic')
## cyl
##  4  6  8 
## 11  7 14
# "But this output isn't labeled, not ideal." 
table(mtcars$cyl, mtcars$vs)
##    
##      0  1
##   4  1 10
##   6  3  4
##   8 14  0
# "I like this because it returns 'sd', 'n' and 'missing'":
mosaic::favstats(~mpg, data=mtcars) 
##   min     Q1 median   Q3  max     mean       sd  n missing
##  10.4 15.425   19.2 22.8 33.9 20.09062 6.026948 32       0

Introducing skimr