Reading Excel & SPSS Files in R | Tips & Tricks
A Researcher’s Love for Excel and SPSS
Working for the research department at the municipality of Rotterdam (OBI), I discovered their love for excel spread sheets and SPSS documents. Hundreds, no, thousands of them. Each file additionally has another 100 sheets. Sometimes, I needed to join sheets, or bind columns and rows, which is tedious in excel itself. I started scouring packages and functions to find what I needed. In this post, you’ll find some tips and tricks that helped me speed things up.
Excel
Let’s first get an overview of useful availabale packages:
ReadXL
Readxl
, the tidyverse
excel readig package, has some incredibly useful features and loads all the data immediatly into the succint tbl
format. You can list sheet names with excel_sheets()
, specify ranges to choose how many rows or columns to load in with excel_sheets(data, range = cell_rows(1:10))
.
I’ve found this feature extremely useful in order to load all sheets of a file into my global environment with the names intact. See the function below.
All Excel sheets into Global Environment with original names
# Read all excel sheets at once###
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(x) <- paste0(sheets) #Use paste0 function to add to your
x
}
# Create list of tibbles
MyData <- read_excel_allsheets("~/Your/File/Location/dataset.xls")
# Unlist all tibbles into environment
list2env(MyData ,.GlobalEnv)
Data Table
data.table's
fread()
and fwrite()
functions are fast. Really fast. Which is incredibly useful if you have large files. Recently I came across a file that was incredibly large. Too large for my laptop’s RAM to handle. There are many ways to increase your computer’s computing power like Hadoop or more recently Spark. Sparklyr
by Rstudio is a good and useful alternative.
However, I did not have that luxury, considering I was working on a Remote Access server. I turned my attention towards data.table. I did not need the entire file. I only needed a subset of the file, filtered by the contents of another file. This was my approach:
library(data.table)
library(dplyr)
#-------------------------------------------
# Loop to filter the large dataset in parts
#-----------------------------------------
for(i in 1:10){
FilterData <- fread("Path/To/FilterData.csv")
#Create sequence to split the dataset. In this case 10 folds.
#specify the by if you want to split it more ways.
Div <- seq(from = 0.1, to = 1, by = 0.1)
#Since we are splitting the .csv file, the first row will no longer be the header.
#Specify your own colnames (most of the time this is better anyway)
ColNames <- c("colname1", "colname2", "etc.")
#Find out the total number of rows in your file
No_Rows <- "total number of rows"
#Specify x and y
x <- round(No_Rows * 0.1, 0)
if(Div[i] == 0.1){
y <- 0}
else{
y <- round(No_Rows - (Div[i]-0.1),0)
}
df <- fread("Path/To/LargeDataFile.csv",
nrow = x, skip = y, col.names = ColNames)
# Filter dataset
df %>%
inner_join(y = FilterData, by = "FilterColumnName") %>%
# Write part of the dataset
fwrite(file = Paste0("Path/To/Output/Location/", "Part_", i, ".csv"))
# Clear Environment to make space for next file
rm(list=ls())
}
#--------------------------------------------------------------
# Binding the dataset back together and writing the final file
#--------------------------------------------------------------
FileNames = list.files(path = "Path/To/Output/Location", pattern = ".csv", full.names = T)
## Create List of Files in the Folder
Parts <- lapply(FileNames, fread)
## Bind dataframes together
Complete <- do.call(rbind, Parts)
## Write the output
fwrite(Complete, file = "Path/To/Output/FinalDataset.csv")
SPSS
Haven
Haven
is another package by Hadley Wickham & co, which is also party of the tidyverse
. Just like readxl
, this package converts the loaded files into the tibble
format. Unlike many other SPSS reading packages, haven
can also write files back into an SPSS format with write_sav()
. This might come in handy if your researchers have only ever heard of R as a letter in the alphabet. Besides .sav, haven
also reads .sas7bdat (SAS) and .dta (Stata) files.
Although Haven is great, I had a similar issue when I came across a file of 20 GB, with approximetly 170 million rows. This led me to the package:
Memisc
memisc
is a package created for the Management of Survey Data and Presentation of Analysis Results. Memisc can find information about the SPSS file without having to load the complete file. Which could be very useful.
For instance, SPSSFile <- spss.system.file("Path/To/File.sav")
creats an importer object. Calling a print(), or just typing the object name gives us this:
##
## SPSS system file 'C:/Users/Stefan/Documents/Projects/Personal-Website/static/data/Excel_SPSS/testdata.sav'
## with 17 variables and 5 observations
While the function description()
gives us this:
Label | |
---|---|
numeric | numeric variable |
numeric_long_label | numeric variable with long label: this variable hat five observations (one is missing). All values between 1 and 2 are also declared as missing. We use two decimal places and the measurement level is “Scale”. |
factor_numeric | numeric factor with missing range |
factor_n_long_value_label | numeric factor with long value labels |
factor_n_coded_miss | numeric factor with coded missing values |
factor_n_duplicated | numeric factor with duplicated factor labels |
factor_n_undeclared | numeric factor with undeclared values |
factor_n_undeclared2 | numeric factor with undeclared values |
string | string variable |
string_500 | long string variable |
strin0 | long string variable |
string_miss | string factor with missings |
factor_s_coded_miss | string factor with coded missing |
factor_s_duplicated | string factor with duplicated factor labels |
factor_s_undeclared | string factor with undeclared values |
factor_s_undeclared2 | string factor with undeclared values |
date | date format tt.mm.yyyy |
If we know which columns we need from the SPSS file, we can specify this with the subset()
function, eg:
df <- as.data.table(subset(SPSSFile,
select = C("colname1",
"colname2",
"colname7",
"etc")))
memisc
has a lot more functionalities, so be sure to check out the documentation.
Roundup
So, before you start reading and writing data immediatly with the same package every time, think about the structure, size and what it is that you actually need. Then choose the most suitable package, and start reading away!