R Programming Guide - Modern Blog

Getting Started with R Programming

Beginner's Guide | Essential Concepts

What is R?

R is an open-source programming language specifically designed for statistical computing, data analysis, and visualisation. It's widely used in academia, research, and industry for data science applications.

# Basic R syntax example
a <- 10
print(a)
# Output: [1] 10

Code Explanation:

a <- 10 - Assigns the value 10 to variable 'a'
print(a) - Outputs the value of 'a'
The <- operator is R's assignment operator
The output shows the value and [1] indicating it's the first element

Understanding the RStudio Interface

Development Environment | Workflow Optimisation

RStudio Panes Explained:

Source Editor - Write and edit your R scripts and notebooks
Console - Execute commands and see immediate results
Environment/History - Track variables and command history
Files/Plots/Packages/Help - Manage your workspace and documentation

# Example of working with data in R
# Load built-in dataset
data(cars)

# View first few rows
head(cars)

Understanding the Panes:

RStudio organises your workspace into these panes to streamline your data analysis workflow.

Essential Data Structures in R

Variables | Vectors | Data Frames

Vectors

The most basic data structure in R - a sequence of data elements of the same type.

Data Frames

Tabular data structure similar to Excel spreadsheets.

Lists

Flexible structures that can hold different data types.

# Creating different data structures
# Vector
numbers <- c(1, 2, 3, 4, 5)
print(numbers)
# Output: [1] 1 2 3 4 5

Data Structure Types:

Vector - Basic one-dimensional data structure
Data Frame - Tabular data structure (like Excel)
Matrix - Two-dimensional data structure
List - Flexible structure that can hold different data types
Factor - Used for categorical data
Array - Multi-dimensional extension of matrices

Creating Beautiful Visualisations with ggplot2

ggplot2 | Plots | Charts | Graphics

Scatter Plots

Perfect for showing relationships between two continuous variables.

Bar Charts

Useful for comparing categorical data.

Histograms

Great for visualising data distributions.

# Load ggplot2 package
library(ggplot2)

# Create a scatter plot
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(color = Species)) +
  labs(title = "Iris Sepal Dimensions",
       x = "Sepal Length (cm)",
       y = "Sepal Width (cm)")

ggplot2 Code Breakdown:

ggplot(data = iris) - Sets iris as the default dataset
aes(x = Sepal.Length, y = Sepal.Width) - Defines aesthetic mappings
geom_point() - Adds points to create scatter plot
aes(color = Species) - Colours points by species category
labs() - Adds labels and title to the plot

Data Wrangling with dplyr

Using `dplyr` functions:

select(): Choose specific columns
filter(): Filter rows based on conditions
arrange(): Sort data
mutate(): Create new columns
group_by(): Group data for summaries

# Example: select columns
library(dplyr)
starwars %>% select(name, height, mass)

# Filter example
starwars %>% filter(species == "Droid", height <= 100)

# Arrange example
starwars %>% arrange(desc(height))

# Mutate example
starwars %>% filter(species == "Human") %>% 
  mutate(bmi = mass / (height/100)^2)

# Group and summarize
starwars %>% group_by(species) %>% 
  summarise(avg_height = mean(height, na.rm=TRUE), 
            avg_mass=mean(mass, na.rm=TRUE))

Hypothesis Testing with T-tests

Comparing Groups | p-value | Statistical Significance

Example: Sleep Study

Suppose we compare sleep hours between two drugs using a t-test.

# Example data
sleep
# Boxplot
library(ggplot2)
ggplot(sleep, aes(x=group, y=extra)) + geom_boxplot()

# T-test (independent)
t.test(extra ~ group, data=sleep)

# Paired t-test (paired data)
t.test(extra ~ group, data=sleep, paired=TRUE)

Understanding p-values:

A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis, suggesting a significant difference between groups.

Permutation Test (Optional Advanced)

Resampling Methods | Null Hypothesis | Significance

What is a permutation test?

It's a non-parametric method to test if two groups differ significantly by randomly shuffling data labels.

# Pool data
pooled <- mosquito_preference$num_mosquitoes

# Shuffle data
shuffled <- sample(pooled)

# Split into fake groups
fake_group1 <- head(shuffled, 25)
fake_group2 <- tail(shuffled, 18)

# Calculate difference
fake_diff <- mean(fake_group1) - mean(fake_group2)

# Repeat many times
fake_diffs <- numeric(10000)
for(i in 1:10000){
  shuffled <- sample(pooled)
  fake_group1 <- head(shuffled, 25)
  fake_group2 <- tail(shuffled, 18)
  fake_diffs[i] <- mean(fake_group1) - mean(fake_group2)
}
# Plot histogram
library(ggplot2)
ggplot(as.data.frame(fake_diffs), aes(x=fake_diffs)) + geom_histogram() +
labs(title="Permutation Distribution of Difference", x="Difference", y="Frequency")

Interpreting results:

Compare your observed difference to this distribution to assess significance.

Master R Programming