Getting Started with R Programming
Beginner's Guide | Essential Concepts
What is R?
R is an open-source programming language specifically designed for statistical computing, data analysis, and visualisation. It's widely used in academia, research, and industry for data science applications.
# Basic R syntax example
a <- 10
print(a)
# Output: [1] 10
Code Explanation:
- a <- 10 - Assigns the value 10 to variable 'a'
- print(a) - Outputs the value of 'a'
- The <- operator is R's assignment operator
- The output shows the value and [1] indicating it's the first element
Understanding the RStudio Interface
Development Environment | Workflow Optimisation
RStudio Panes Explained:
- Source Editor - Write and edit your R scripts and notebooks
- Console - Execute commands and see immediate results
- Environment/History - Track variables and command history
- Files/Plots/Packages/Help - Manage your workspace and documentation
# Example of working with data in R
# Load built-in dataset
data(cars)
# View first few rows
head(cars)
Understanding the Panes:
RStudio organises your workspace into these panes to streamline your data analysis workflow.
Essential Data Structures in R
Variables | Vectors | Data Frames
Vectors
The most basic data structure in R - a sequence of data elements of the same type.
Data Frames
Tabular data structure similar to Excel spreadsheets.
Lists
Flexible structures that can hold different data types.
# Creating different data structures
# Vector
numbers <- c(1, 2, 3, 4, 5)
print(numbers)
# Output: [1] 1 2 3 4 5
Data Structure Types:
- Vector - Basic one-dimensional data structure
- Data Frame - Tabular data structure (like Excel)
- Matrix - Two-dimensional data structure
- List - Flexible structure that can hold different data types
- Factor - Used for categorical data
- Array - Multi-dimensional extension of matrices
Creating Beautiful Visualisations with ggplot2
ggplot2 | Plots | Charts | Graphics
Scatter Plots
Perfect for showing relationships between two continuous variables.
Bar Charts
Useful for comparing categorical data.
Histograms
Great for visualising data distributions.
# Load ggplot2 package
library(ggplot2)
# Create a scatter plot
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species)) +
labs(title = "Iris Sepal Dimensions",
x = "Sepal Length (cm)",
y = "Sepal Width (cm)")
ggplot2 Code Breakdown:
- ggplot(data = iris) - Sets iris as the default dataset
- aes(x = Sepal.Length, y = Sepal.Width) - Defines aesthetic mappings
- geom_point() - Adds points to create scatter plot
- aes(color = Species) - Colours points by species category
- labs() - Adds labels and title to the plot
Data Wrangling with dplyr
Manipulating Data | Select | Filter | Arrange | Mutate | Grouping
Using `dplyr` functions:
- select(): Choose specific columns
- filter(): Filter rows based on conditions
- arrange(): Sort data
- mutate(): Create new columns
- group_by(): Group data for summaries
# Example: select columns
library(dplyr)
starwars %>% select(name, height, mass)
# Filter example
starwars %>% filter(species == "Droid", height <= 100)
# Arrange example
starwars %>% arrange(desc(height))
# Mutate example
starwars %>% filter(species == "Human") %>%
mutate(bmi = mass / (height/100)^2)
# Group and summarize
starwars %>% group_by(species) %>%
summarise(avg_height = mean(height, na.rm=TRUE),
avg_mass=mean(mass, na.rm=TRUE))
Hypothesis Testing with T-tests
Comparing Groups | p-value | Statistical Significance
Example: Sleep Study
Suppose we compare sleep hours between two drugs using a t-test.
# Example data
sleep
# Boxplot
library(ggplot2)
ggplot(sleep, aes(x=group, y=extra)) + geom_boxplot()
# T-test (independent)
t.test(extra ~ group, data=sleep)
# Paired t-test (paired data)
t.test(extra ~ group, data=sleep, paired=TRUE)
Understanding p-values:
A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis, suggesting a significant difference between groups.
Permutation Test (Optional Advanced)
Resampling Methods | Null Hypothesis | Significance
What is a permutation test?
It's a non-parametric method to test if two groups differ significantly by randomly shuffling data labels.
# Pool data
pooled <- mosquito_preference$num_mosquitoes
# Shuffle data
shuffled <- sample(pooled)
# Split into fake groups
fake_group1 <- head(shuffled, 25)
fake_group2 <- tail(shuffled, 18)
# Calculate difference
fake_diff <- mean(fake_group1) - mean(fake_group2)
# Repeat many times
fake_diffs <- numeric(10000)
for(i in 1:10000){
shuffled <- sample(pooled)
fake_group1 <- head(shuffled, 25)
fake_group2 <- tail(shuffled, 18)
fake_diffs[i] <- mean(fake_group1) - mean(fake_group2)
}
# Plot histogram
library(ggplot2)
ggplot(as.data.frame(fake_diffs), aes(x=fake_diffs)) + geom_histogram() +
labs(title="Permutation Distribution of Difference", x="Difference", y="Frequency")
Interpreting results:
Compare your observed difference to this distribution to assess significance.