Every data analysis project begins with a simple question: what does the data look like? To answer that, we typically start with descriptive statistics – such as mean, median, standard deviation, and variance – to gain an initial overview of the dataset. If you are using RStudio, these summary statistics can be quickly generated using a variety of packages, including base R, dplyr, and data.table. To illustrate these methods, let’s start by defining a small data frame that we’ll use throughout this post:
data <- data.frame(
score = c(88, 92, 79, 85, 90),
group = as.factor(c("A", "A", "B", "B", "B")))
Here’s what the data frame looks like:
score group 1 88 A 2 92 A 3 79 B 4 85 B 5 90 B
This data frame contains five rows and two columns: a numerical column score and a categorical column group.
1. Summarizing data with Base R
With base R, we can obtain descriptive statistics for a data frame using the summary() function as follows:
summary(data) score group Min. :79.0 A:2 1st Qu.:85.0 B:3 Median :88.0 Mean :86.8 3rd Qu.:90.0 Max. :92.0
When applied to a numerical column (e.g. score), the summary() function returns the minimum, 1st quartile, median, mean, 3rd quartile, and the maximum. When applied to a categorical column (e.g. group), it returns the count for each group if the column is explicitly defined as a factor (as we have done for this data frame), otherwise R will only display the length and data type of the column rather than the frequency of each category.
In addition to overall summaries, it is often helpful to compute the summary statistics separately for each group. Using the tapply() function, we can apply functions such as mean, sd, sum, or length to each subgroup of data. For example, we can apply the tapply() function to our data frame to calculate the mean and standard deviation separately for each group:
tapply(
X = data$score,
INDEX = data$group,
FUN = function(x) {
c(mean = mean(x), sd = sd(x))
}
)
The tapply() function works by taking the values in X, splitting them into subgroups defined by the categories in INDEX, applying your chosen FUN to each subgroup, and finally returning the organized results. The output is shown below:
$A mean sd 90.0000 2.8284 $B mean sd 84.6667 5.5076
2. Summarizing data with the dplyr package
The dplyr package is an R package designed for efficient and user-friendly data manipulation, providing a set of functions for selecting, filtering, arranging, and summarizing the data. To calculate descriptive statistics for a numerical variable using the dplyr package, we can use the summarise() function as follows:
library(dplyr)
data %>%
summarise(
mean_score = mean(score),
min_score = min(score),
max_score = max(score),
sd_score = sd(score)
)
This produces a table containing descriptive statistics for the score column:
mean_score min_score max_score sd_score 86.8 79 92 5.26308
To compute summary statistics separately for each group, we can use the summarise() function together with group_by() function:
data %>%
group_by(group) %>%
summarise(
mean_score = mean(score),
min_score = min(score),
max_score = max(score),
sd_score = sd(score),
.groups = "drop"
)
The output now displays summary statistics separately for groups A and B:
# A tibble: 2 × 5 group mean_score min_score max_score sd_score <chr> <dbl> <dbl> <dbl> <dbl> 1 A 90 88 92 2.83 2 B 84.7 79 90 5.51
Compared to base R, the dplyr package offers a simpler, more intuitive, and more readable syntax for data manipulation. Many tasks that require multiple lines of code or complex indexing in base R can be expressed in a single, clear statement with dplyr. In addition, dplyr functions are optimized for performance, especially with large datasets, and integrate seamlessly with other tidyverse packages.
3. Summarizing data with the data.table package:
The data.table package is known for its speed and concise syntax, making it especially useful when working with large datasets. It extends the functionality of data frames by introducing a powerful three-part notation [i, j, by], where i specifies rows, j specifies columns or calculations, and by defines groups for aggregation. This compact structure allows subsetting, transforming, and summarizing data in a single step. For example, the following code creates a data.table from our dataset and computes the mean, minimum, maximum, and standard deviation of score separately for each group:
library(data.table)
dt <- as.data.table(data)
dt[, .(
mean_score = mean(score),
min_score = min(score),
max_score = max(score),
sd_score = sd(score)
), by = group]
The result is a table with summary statistics calculated separately for each group:
group mean_score min_score max_score sd_score 1: A 90.0 88 92 2.83 2: B 84.7 79 90 5.51
Need Help from an R Tutor?
If you’re finding it challenging to summarize your data in R Studio, working with an experienced tutor can save you time and make learning R a more enjoyable, less stressful experience. Visit our R Studio Tutor page to learn more about our one-on-one tutoring services and assignment assistance.
