This section covers an important strategy when you are cleaning and prepping your data, which is to aim to create tidy data. This is a specific approach for establishing how data should be structured before you work with it. It is a concept/approach that is embraced by the godfather of R, Hadley Wickham, and, as such, is supported by his collection of packages (the so-called “tidyverse”).
There are a great many resources on tidy data that will explain it better than we can here, so we will just give it a brief summary and leave some links for further reading. But, the main point is that we embrace its philosophy, and ,as a data analyst, it is wise to aim for tidy data after you have imported it.
“Tidy datasets are all alike, but every messy dataset is messy in its own way” - Hadley Wickham
The aim of a tidy dataset is to present it in a manner that further processing can be done in an organised fashion:
If you are a heavy user of Excel (and, specifically, Excel pivot tables), then you can think about tidy data as being data that is extremely “pivot-friendly.” Think about a time when you knew you needed a pivot table, but the raw data had dimensions both in rows and columns (e.g., Campaign in rows and Device Category in columns, with Sessions in the cells). If you found yourself doing some manual or formula-heavy transformation of the raw data just to get it to something you wanted to use to feed a pivot table, then you’ve experienced non-tidy data.
Tidy data isn’t necessarily data that is easiest to read by a human, but it is well suited for passing into further R functions, especially those within the tidyverse
.
Two bits of good news about tidy data:
tidyr
package) specifically for converting data to/from a tidy format.Tidyverse
is the new name for the Hadleyverse
, which is a collection of R packages created or supported by Hadley Wickham.
These packages aim to cover the whole spectrum of data analysis within R, each supporting the other in concepts and output. The Big Three packages within the tidyverse are:
dplyr
- data manipulation of data framestidyr
- tools to tidy (and untidy) data framesggplot2
- plotting tidy dataTip: Note that the three packages above have pretty unique names. That makes them handy to include in Google searches when you are looking for details on how to perform a specific operation or troubleshoot a specific issue (e.g., “filter data by column name with dplyr”). Including one of these package names virtually guarantees that the search results will all be R-specific.
In addition, some other packages in the tidyverse that you may come across are:
tibble
- creating more user-friendly data.framespurrr
- more general data manipulation toolsmagrittr
- the origin of %>%
, extensivly used in the tidyverse (This isn’t actually a package developed by Wickham, but his packages have a dependency on this one; dropping, “Oh, yeah. The pipe in dplyr
actually comes from the magrittr
package” in casual conversation with R folk will provide a nominal degree of street cred.)broom
- turn statistical models into tidy data frames/tiddlesThis list is just a subset of the tidyverse packages, but, the tidyverse has grown so much in popularity that you can now simply install/load the ‘tidyverse’ package, which will install all of these packages (and a few more) all at once.
Whilst you can achieve what the above packages do in base R or with other packages (data.table()
is a noteable alternative to dplyr()
), if you embrace the tidyverse
, then you will need to become familiar with some, if not all, of the packages above.