Abstract
Here you will find all the bioinformatic analysis performed in this paper together with a description of each code chunk.This R code chunk is loading several libraries and defining a custom
color palette. It begins by loading various R libraries, including
scifer
, dplyr
, cluster
,
factoextra
, ggsci
, tidyr
,
microbenchmark
, ggpubr
, rstatix
,
and ggprism
. These libraries provide various functions and
tools for data manipulation, statistical analysis, and data
visualization. Additionally, the code defines a custom color palette
named color_palette
#Load libraries
# if you do not have libraries, they are located in either CRAN or Bioconductor
# libraries should be able to be automatically retrieved using renv::restore() function.
library(scifer)
library(dplyr)
library(cluster)
library(factoextra)
library(ggsci)
library(tidyr)
library(microbenchmark)
library(ggpubr)
library(rstatix)
library(ggprism)
color_palette <- c("1" = "#4DBBD5B2", "2" = "#DC0000B2", "High quality" = "#4DBBD5B2", "Low quality" = "#DC0000B2", "scifer" = "#4DBBD5B2", "unprocessed" = "#DC0000B2", "filtered_out" = "grey20", "phred30" = "grey90")
This R code chunk processes heavy and light chain sequencing data. It calculates quality summaries for sequences located in a specified folder and saves these summaries as a compressed RDS file named “hc_quality_table.rds” in a data directory, which should be defined elsewhere in the code.
file_location <- sub("/src", "",getwd())
hc_quality <- summarise_quality(folder_sequences = paste0(file_location,"/data/HC_raw_ab1_files"), processors = NULL)
hc_quality$summaries %>%
select(-file.path) %>%
saveRDS(file = paste0(data.dir, "/processed_data/hc_quality_table.rds"), compress = "gzip")
Previously processed data is loaded to avoid redundant computations. It reads in a processed quality table stored as an RDS file and removes any rows with missing values. Additionally, it loads data from two separate files, “scifer_assigned.tsv.gz” and “unprocessed_assigned.tsv.gz,” both in tab-separated format, likely containing alignment results or assignments.
# read rds data
hc_quality_table <- readRDS(here::here("data/processed_data/hc_quality_table.rds"))
# remove NAs
hc_quality_table <- na.omit(hc_quality_table)
# read IgBlast output from scifer filtered sequences
scifer_aligned <- read.table(here::here("data/processed_data/scifer_assigned.tsv.gz"), sep = "\t", header = TRUE)
# read IgBlast output from unfiltered sequences
unprocessed_aligned <- read.table(here::here("data/processed_data/unprocessed_assigned.tsv.gz"), sep = "\t", header = TRUE)
First the numbers of clusters were decided based on the Silhouette method. The average silhouette measures the quality of each clusters based on well each each sample fits in a given cluster. Higher this metric, better the clustering. As seen below, the optimal number of clusters based on this method was 2. This also facilitates the separation of “high quality” and “low quality” sequences between two clusters.
# scale variables prior clustering
scaled_table <- scale(hc_quality_table[,-c(1,2)])
# identify optimal number of clusters
fviz_nbclust(scaled_table[,-c(1,2)], kmeans, method = "silhouette") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir, filename = "silhouette_optimal_clusters.pdf", height = 5, width = 5)
clusters_table <- kmeans(scaled_table[,-c(1,2)], centers = 2, nstart = 25)
hc_quality_cluster <- cbind(hc_quality_table, cluster = clusters_table$cluster)
This code chunk performs Principal Component Analysis (PCA) on the
data, with colors assigned based on k-means clustering. It begins by
processing and scaling the data for PCA, excluding specific columns.
Then, it generates a PCA scatterplot using the fviz_pca_ind
function, with data points colored and shaped according to clustering
information. The color palette is customized, and various themes are
applied to the plot’s appearance. Finally, the resulting plot is saved
as a PDF file with specified dimensions and file name.
# process and scale data for PCA
qual_param_pca <- hc_quality_cluster %>%
select(-c("folder.name", "file.name", "cluster")) %>%
prcomp(scale = TRUE)
fviz_pca_ind(qual_param_pca, geom = "point",
habillage=hc_quality_cluster$cluster) +
labs(color = "Clusters", shape = "Clusters", fill = "Clusters") +
scale_color_manual(values = color_palette) +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir,filename = "pca_optimal_clusters.pdf", height = 5, width = 5)
This code section visualizes the contributions of variables to each Principal Component (PC). It starts by plotting the top 10 contributors to PC1, with custom styling and themes applied. The resulting plot is saved as “pca_dim1_contrib.pdf” Next, it does the same for PC2, generating a plot for the top contributors and saving it as “pca_dim2_contrib.pdf” with specified dimensions and file names.
# contributions of PC1
fviz_pca_contrib(qual_param_pca, axes = 1, top = 10, choice = "var", color = "black") +
labs(x = "") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
ggsave(path = result.dir, filename = "pca_dim1_contrib.pdf", height = 5, width = 5)
# contributions of PC2
fviz_pca_contrib(qual_param_pca, axes = 2, top = 10, choice = "var", color = "black") +
labs(x = "") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
ggsave(path = result.dir,filename = "pca_dim2_contrib.pdf", height = 5, width = 5)
Variable contributions for the entire Principal Component Analysis (PCA) are visualized. It creates a plot that displays the contributions of variables to all PCs.
fviz_pca_var(qual_param_pca) +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir,filename = "pca_var_vectors_contrib.pdf", height = 5, width = 5)
The code calculates summary statistics for each quality parameter
within different clusters, specifically computing the mean and standard
deviation. These statistics are generated to establish filtering
criteria, the mean±3SD
for each cluster was taken into
account to capture at least 99% of all cluster members. However, this
code does not include explicit filtering actions; it only prepares
summary information for further decision-making regarding data
filtering.
# print summary information about each cluster parameters
qual_cluster_summ <- hc_quality_cluster %>%
as.data.frame() %>%
group_by(cluster) %>%
summarise(across(everything(), list(mean = mean, sd = sd))) %>%
select(-(2:5))
# print summary information about each cluster parameters
qual_cluster_summ %>%
summarise(cluster = cluster,
trimmed_length = trimmed.length_mean - 3*trimmed.length_sd,
length_lower = raw.length_mean - 3*raw.length_sd,
trim_start = trim.start_mean + 3*trim.start_sd,
trim_finish = trim.finish_mean - 3*trim.finish_sd,
phred_score = trimmed.mean.quality_mean - 3*trimmed.mean.quality_sd)
## # A tibble: 2 × 6
## cluster trimmed_length length_lower trim_start trim_finish phred_score
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 301. 343. 45.3 320. 33.5
## 2 2 -173. 196. 177. -121. -1.28
t first calculates and visualizes the “Length filter” based on the
raw sequence length. Sequences are categorized as “High quality” or “Low
quality” depending on whether they exceed a length threshold of
343
. The resulting histogram is colored according to a
custom palette, with a dashed vertical line at the threshold. The plot
is saved as “length_filter.pdf” with specified dimensions and file
name.
hc_quality_table %>%
mutate(filtered = ifelse(raw.length > 343, "High quality", "Low quality")) %>%
ggplot(aes(x = raw.length, groups = filtered, fill = filtered)) +
geom_histogram(binwidth = 10, color = "black") +
scale_fill_manual(values = color_palette) +
labs(x = "Raw sequence length", y = "Count", title = "Length") +
geom_vline(linetype = "dashed", color = "#4DBBD5B2", xintercept = 343) +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir,filename = "length_filter.pdf", height = 5, width = 5)
This code section contains code for plotting the “Contiguous quality
filter.” It visualizes the distribution of trimming positions based on
quality criteria. Trimming positions are categorized as “High quality”
or “Low quality” based on specified thresholds (65
and
400
). The resulting histogram is colored according to a
custom palette and includes dashed vertical lines at the threshold
values. The plot is saved as “trimm_values_filter.pdf” with specified
dimensions and file name.
hc_quality_table %>%
pivot_longer(c(trim.start, trim.finish), names_to = "trimming", values_to = "trimm_values") %>%
mutate(filtered = ifelse(trimm_values < 65|trimm_values > 400, "High quality", "Low quality")) %>%
ggplot(aes(x = trimm_values, group = filtered, fill = filtered)) +
geom_histogram(binwidth = 15, color = "black") +
geom_vline(linetype = "dashed", color = "#4DBBD5B2", xintercept = 65) +
geom_vline(linetype = "dashed", color = "#4DBBD5B2", xintercept = 400) +
scale_fill_manual(values = color_palette) +
labs(x = "Sequence trimming positions", y = "Count", title = "Trimming positions") +
scale_y_continuous(expand = c(0,0), limits = c(0, 4200)) +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir, filename = "trimm_values_filter.pdf", height = 5, width = 5)
This section creates a histogram displaying the distribution of Phred
quality scores after trimming, categorizing them as “High quality” or
“Low quality” based on a threshold of 30
. Dashed vertical
lines mark the threshold, and the plot is colored using a custom
palette. The resulting visualization is saved as
“phred_score_filter.pdf” with specified dimensions and file name.
hc_quality_table %>%
mutate(filtered = ifelse(trimmed.mean.quality > 30, "High quality", "Low quality")) %>%
ggplot(aes(x = trimmed.mean.quality, group = filtered, fill = filtered)) +
geom_histogram(color = "black") +
geom_vline(linetype = "dashed", color = "#4DBBD5B2", xintercept = 30) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(expand = c(0,0), limits = c(0, 1000)) +
labs(x = "Phred score (after trimming)", y = "Count", title = "Basecall quality") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir,filename = "phred_score_filter.pdf", height = 5, width = 5)
This section performs data processing related to HCDR3 regions. It
identifies secondary peaks within the sequencing data and filters them
based on position criteria. The resulting data is integrated with the
hc_quality_cluster
dataframe, and the combined data is
saved as “hc_quality_full_table.csv” in the specified directory. Here
the HCDR3 is defined by the user as within the position to
100
from 150
. However, to speed up the
analysis, this chunk was run once and the output was saved to be used on
the chunk below.
file_locations <- list.files(path = paste0(data.dir, "/HC_raw_ab1_files"),recursive = T, full.names = T)
sangerseqlisted <- sapply(file_locations, sangerseqR::readsangerseq)
sp <- lapply(sangerseqlisted, secondary_peaks, ratio = 0.33, processors = NULL)
df <- lapply(sp, function(x) x[["secondary.peaks"]])
df <- lapply(df, function(x) filter(x, position > 100 & position < 150))
df <- lapply(df, function(x) ifelse(nrow(x) > 0, nrow(x), 0))
df <- tibble(sec.peak.CDR3=as.numeric(as.character(df)), file.path=names(df))
hc_quality_cluster <-
df %>%
separate(file.path, into = c("_1", "_2","_3","folder.name", "file.name"), sep = "/") %>%
select(-c("_1", "_2","_3")) %>%
right_join(hc_quality_cluster %>% mutate(folder.name = gsub("_HC", "",folder.name)), by = c("folder.name", "file.name"))
write.csv(hc_quality_cluster, paste0(data.dir, "/processed_data/hc_quality_full_table.csv"),row.names = FALSE)
Processes and plots data related to the HCDR3 region from the output of the chunk above. It begins by reading a CSV file, making various data transformations, and filtering for sequences with the highest raw mean quality. Then, it creates a histogram plot of the secondary peaks (HCDR3) with a threshold line at 5. The plot is styled with custom themes, and the resulting visualization is saved as “cdr3_filter.pdf” with specified dimensions and file name.
hc_quality_cluster <- read.csv(paste0(data.dir, "/processed_data/hc_quality_full_table.csv")) %>%
mutate(plate = sub("_HC", "",folder.name),
well = gsub("\\-.*|\\_.*", "", file.name),
well_number = gsub("[a-zA-Z]", "", well),
well_letter = gsub("[[:digit:]]", "", well),
well = paste0(well_letter, sprintf("%02s", well_number)),
ID = gsub("\\-.*|\\_.*", "", folder.name),
sequence_id = sub("_R", "", paste(plate, well, sep = "_"))) %>%
select(-c(well_number, well_letter, plate, well, ID)) %>%
group_by(sequence_id) %>%
filter(raw.mean.quality == max(raw.mean.quality)) %>%
ungroup()
hc_quality_cluster %>%
mutate(filtered = ifelse(sec.peak.CDR3 <= 5, "High quality", "Low quality")) %>%
ggplot(aes(x = sec.peak.CDR3, group = filtered, fill = filtered)) +
geom_histogram(color = "black") +
geom_vline(linetype = "dashed", color = "#4DBBD5B2", xintercept = 5) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(expand = c(0,0), limits = c(0, 3400)) +
labs(x = "Secondary peaks (HCDR3)", y = "Count", title = "HCDR3 quality") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir,filename = "cdr3_filter.pdf", height = 5, width = 5)
Here a comparison is made between filtered scifer
and
non-filtered sequences. It starts by adding a “filter” column to the
hc_quality_cluster
dataframe based on specified quality
criteria. Then, it performs PCA on the filtered data, excluding certain
columns. The resulting PCA scatterplot is colored based on the “High
quality” or “Low quality” using all filters set up in the
previously.
hc_quality_cluster <- hc_quality_cluster %>%
mutate(filter = ifelse( trimmed.length >= 343 &
trim.finish >= 400 &
trimmed.mean.quality >= 30 &
trim.start <= 65 &
sec.peak.CDR3 <= 5,
"High quality", "Low quality"))
qual_final_pca <- hc_quality_cluster %>%
select(-c("folder.name", "file.name", "cluster", "filter", "sequence_id")) %>%
prcomp(scale = TRUE)
fviz_pca_ind(qual_final_pca, geom = "point",
habillage=hc_quality_cluster$filter) +
scale_color_manual(values = color_palette) +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12))
ggsave(path = result.dir,filename = "pca_final_filtered.pdf", height = 5, width = 5)
Here it was evaluated the processing speed of different sequence
counts under various parallelization settings. It first defines folders
for different sequence counts, then measures the processing time using
the microbenchmark
package, both in sequential and parallel
modes. The results are saved as “parallelization_comp.csv”. The first
code chunk was run only once and the output was saved for plotting to
speed up the analysis.
folder_1seq <- paste0(data.dir,"/HC_raw_ab1_speed/seq_1")
folder_10seq <- paste0(data.dir, "/HC_raw_ab1_speed/seq_10")
folder_100seq <- paste0(data.dir, "/HC_raw_ab1_speed/seq_100")
folder_1000seq <- paste0(data.dir, "/HC_raw_ab1_speed/seq_1000")
mbm <- microbenchmark(
sequence_1 = summarise_quality(folder_1seq, processors = 1),
sequence_10 = summarise_quality(folder_10seq, processors = 1),
sequence_100 = summarise_quality(folder_100seq, processors = 1),
sequence_1000 = summarise_quality(folder_1000seq, processors = 1),
sequence_1_parallel = summarise_quality(folder_1seq, processors = 8),
sequence_10_parallel = summarise_quality(folder_10seq, processors = 8),
sequence_100_parallel = summarise_quality(folder_100seq, processors = 8),
sequence_1000_parallel = summarise_quality(folder_1000seq, processors = 8),
times = 2
)
nanosec_to_sec <- function(nanoseconds) {
seconds <- nanoseconds / 1e9
return(seconds)
}
mbm <- mbm %>%
separate(expr, into = c("x", "number_seqs", "parallelization")) %>%
mutate(processing = ifelse(is.na(parallelization), "sequential", "parallel"),
time_seconds = nanosec_to_sec(time)) %>%
select(-c(parallelization,x))
write.csv(mbm, paste0(data.dir, "/processed_data/parallelization_comp.csv"),row.names = FALSE)
This code section reads the previously saved data and generates a plot comparing processing time for different sequence counts and parallelization settings. The plot shows mean processing times, error bars, and scaling of the x-axis.
mbm <- read.csv(paste0(data.dir,"/processed_data/parallelization_comp.csv"))
mbm %>%
group_by(processing, number_seqs) %>%
summarise(mean_time = mean(time_seconds),
sd_time = sd(time_seconds)) %>%
ungroup() %>%
ggplot(aes(x = number_seqs, y = mean_time, group = processing, color = processing)) +
geom_point() +
geom_line() +
scale_x_log10() +
geom_errorbar(aes(ymin=mean_time-sd_time, ymax=mean_time+sd_time), width=.2) +
theme(legend.position = "top") +
labs(x = "Sequence counts", y = "Time (seconds)", groups = "Processing", color = "Processing", title = "Processing speed") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12)) +
scale_color_d3()
ggsave(path = result.dir, filename = "processing_speed.pdf", height = 5, width = 5)
This code portion begings by reading and processing AB1 files,
converting them to FASTA format while filtering out repeated and
low-quality sequences. These processed sequences are saved as
“unprocessed_sequences.fasta”. The quality_report()
wrapper
function is used on a set of sequences, with both flow cytometry index
sort data and Sanger sequences. This function assess sequence quality
and merge it with flow cytometry data, generating an HTML report named
“QC-report.html” and saving it to a specified output directory.
# read ab1 files and save fasta withour processing
list_ab1 <- list.files(paste0(data.dir,"/HC_raw_ab1_files"), full.names = TRUE, recursive = TRUE)
names_list <- lapply(list_ab1, function(x){
name <- strsplit(x, split = "/")
name_plate <- name[[1]][[4]]
well_row <- stringr::str_match(name[[1]][[5]], "^(.)[0-9]+-")[, 2]
well_column <- sprintf("%02d",as.numeric(stringr::str_match(name[[1]][[5]], "([0-9]+)-")[, 2]))
name <- paste(name_plate, paste0(well_row, well_column), collapse = "", sep = "_")
return(name)
})
list_sangerseq <- parallel::mclapply(list_ab1, sangerseqR::readsangerseq, mc.cores = 7)
list_sangerseq <- parallel::mclapply(list_sangerseq, sangerseqR::primarySeq, mc.cores = 7)
names(list_sangerseq) <- unlist(names_list)
unprocessed_fasta <- Biostrings::BStringSet(list_sangerseq)
# remove repeated sequences and unknown specificity
unprocessed_fasta <- unprocessed_fasta[!grepl("_R_", names(unprocessed_fasta))]
unprocessed_fasta <- unprocessed_fasta[!grepl("_NA", names(unprocessed_fasta))]
Biostrings::writeXStringSet(unprocessed_fasta, paste0(data.dir,"/unprocessed/unprocessed_sequences.fasta"))
# run quality_report function to filter High quality and merge with flow data
quality_report(
folder_sequences = "~/OneDrive - Karolinska Institutet/projects/r-package/scifer_manuscript/data/HC_raw_ab1_files/",
outputfile = "QC-report.html",
output_dir = "~/OneDrive - Karolinska Institutet/projects/r-package/scifer_manuscript/results/2023-07-20/",
folder_path_fcs = "~/OneDrive - Karolinska Institutet/projects/r-package/scifer_manuscript/data/flow_cytometry/KI" , compensation = TRUE, plate_wells = "96",
probe1 = "Pre.F", probe2 = "Post.F",
posvalue_probe1 = 600, posvalue_probe2 = 400,
cdr3_start = 100,
cdr3_end = 150
)
IgBlast was used within the IgDiscover software (v. 0.15). Below is an example bash code to demonstrate how it was run. To avoid time-consuming IgBlast gene assignment/alignment, we have saved the results and used as input. The processed sequences are loaded in the Load processed data section.
# enter correct results folder generated within R
current_date=$(date "+%Y-%m-%d")
cd results/$current_date
# run igdiscover init for unprocessed sequences
igdiscover init --database data/database/KIMDB_rm --single-reads ../../data/unprocessed/unprocessed_sequences.fasta unprocessed_seqs_igblast
# run igdiscover/IgBlast
cd unprocessed_seqs_igblast && igdiscover run
# run igdiscover init for scifer-filtered sequences
igdiscover init --database data/database/KIMDB_rm --single-reads ../../data/scifer_filtered_sequences.fasta scifer_seqs_igblast
# run igdiscover/IgBlast
cd scifer_seqs_igblast && igdiscover run
The alignment quality of sequencing data is compared across different preprocessing methods, including “scifer,” “phred30,” “unprocessed,” and “filtered_out.” The alignment scores for genes V, D, and J are analyzed using one-way ANOVA and post-hoc Tukey tests. Cohen’s effect size is also calculated, and the results are visualized with violin plots.
merged_datasets <- data.table::rbindlist(list(scifer = scifer_aligned,
phred30 = unprocessed_aligned[unprocessed_aligned$sequence_id %in% (hc_quality_cluster %>% filter(trimmed.mean.quality > 30) %>% pull(sequence_id)),],
unprocessed = unprocessed_aligned,
filtered_out = unprocessed_aligned[!(unprocessed_aligned$sequence_id %in% sub("_[^_]*$", "", scifer_aligned$sequence_id)),]),
idcol = TRUE) %>%
rename(preprocessing = .id) %>%
pivot_longer(cols = c("v_score", "j_score", "d_score"),
names_to = "gene_score",
values_to = "alignment_score") %>%
mutate(gene_score = plyr::mapvalues(gene_score,
from = c("v_score", "d_score", "j_score"),
to = c("HV gene", "HD gene", "HJ gene")),
gene_score = factor(gene_score, levels = c("HV gene", "HD gene", "HJ gene")),
preprocessing = factor(preprocessing, levels = c("scifer", "phred30","unprocessed", "filtered_out")))
# One way-anova to check if there is a difference between groups
anova <- merged_datasets %>%
group_by(gene_score) %>%
anova_test(formula = alignment_score ~ preprocessing)
anova
## # A tibble: 3 × 8
## gene_score Effect DFn DFd F p `p<.05` ges
## * <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 HV gene preprocessing 3 13957 2481. 0 * 0.348
## 2 HD gene preprocessing 3 13547 166. 7.85e-106 * 0.036
## 3 HJ gene preprocessing 3 13735 1792. 0 * 0.281
# The two-tailed post-hoc Tukey test with an FDR correction is run
stat.test <- merged_datasets %>%
group_by(gene_score) %>%
tukey_hsd(formula = alignment_score ~ preprocessing,
p.adjust.method = "fdr") %>%
add_xy_position(x = "preprocessing") %>%
mutate(p.adj = ifelse(p.adj <= 0.00001, "< 0.0001", round(p.adj, 4)))
# The effect size was calculated using Cohen's method
stat.effectsize <- merged_datasets %>%
group_by(gene_score) %>%
cohens_d(formula = alignment_score ~ preprocessing)
# Staltistical results are merged in a single object
stats <- left_join(stat.test, stat.effectsize,
by = c("group1","group2","gene_score")) %>%
mutate(effsize = paste0(p.adj.signif,", ","d = ",abs(round(effsize, 2))))
merged_datasets %>%
ggpubr::ggviolin(y = "alignment_score", x = "preprocessing", facet.by = "gene_score",legend = "none", fill = "preprocessing") +
geom_boxplot(aes(fill = preprocessing),outlier.shape = NA, width = 0.10, color = "black", alpha = 0.2) +
stat_pvalue_manual(stats, label = "effsize", step.increase = .01) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(limits = c(0,1000) , expand = c(0,0)) +
labs(x = "", y = "Alignment score") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.position = "none")
ggsave(path = result.dir,filename = "alignment_score_phred30.pdf", height = 5, width = 5)
The alignment score comparison, focusing on the “scifer” and “unprocessed” preprocessing methods for genes V, D, and J. Two-tailed Student’s t-tests with FDR correction are performed, and Cohen’s effect size is calculated. The results are visualized using violin plots.
merged_datasets <- data.table::rbindlist(list(scifer = scifer_aligned,
unprocessed = unprocessed_aligned),
idcol = TRUE) %>%
rename(preprocessing = .id) %>%
pivot_longer(cols = c("v_score", "j_score", "d_score"),
names_to = "gene_score",
values_to = "alignment_score") %>%
mutate(gene_score = plyr::mapvalues(gene_score,
from = c("v_score", "d_score", "j_score"),
to = c("HV gene", "HD gene", "HJ gene")),
gene_score = factor(gene_score, levels = c("HV gene", "HD gene", "HJ gene")),
comparator_score = paste(preprocessing, gene_score, sep = "_"))
# set comparisons
my_comparisons <- list(c("scifer", "unprocessed"))
my_comparisons <- c(lapply(my_comparisons, paste0, "_HV gene"),
lapply(my_comparisons, paste0, "_HD gene"),
lapply(my_comparisons, paste0, "_HJ gene"))
# The two-tailed Student's t-test with an FDR correction is run
stat.test <- merged_datasets %>%
group_by(gene_score) %>%
t_test(formula = alignment_score ~ preprocessing,
paired = FALSE,
comparisons = my_comparisons,
p.adjust.method = "fdr") %>%
add_xy_position(x = "preprocessing") %>%
mutate(p = ifelse(p <= 0.00001, "< 0.0001",p))
# The effect size was calculated using Cohen's method
stat.effectsize <- merged_datasets %>%
group_by(gene_score) %>%
cohens_d(formula = alignment_score ~ preprocessing,
comparisons = my_comparisons,
paired = FALSE)
# Staltistical results are merged in a single object
stats <- left_join(stat.test, stat.effectsize,
by = c("group1","group2", ".y.", "n1", "n2", "gene_score")) %>%
mutate(effsize = paste0("p ",p,", ","d = ",round(effsize, 2)))
merged_datasets %>%
ggpubr::ggviolin(y = "alignment_score", x = "preprocessing", facet.by = "gene_score",legend = "none", fill = "preprocessing") +
geom_boxplot(aes(fill = preprocessing),outlier.shape = NA, width = 0.10, color = "black", alpha = 0.2) +
stat_pvalue_manual(stats, label = "effsize", y.position = 550) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(limits = c(0,600) , expand = c(0,0)) +
labs(x = "", y = "Alignment score") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.position = "none")
ggsave(path = result.dir, filename = "alignment_score_phred.pdf", height = 5, width = 5)
This section compares gene coverage for the “scifer” and “unprocessed” preprocessing methods. It calculates coverage for genes V, D, and J. Two-tailed Student’s t-test FDR correction is performed and Cohen’s effect size is calculated. The results are presented using violin plots.
merged_datasets <- data.table::rbindlist(list(scifer = scifer_aligned,
unprocessed = unprocessed_aligned), idcol = TRUE) %>%
rename(preprocessing = .id) %>%
pivot_longer(cols = c("V_covered", "J_covered", "D_covered"),
names_to = "gene_covered",
values_to = "alignment_covered") %>%
mutate(gene_covered = plyr::mapvalues(gene_covered,
from = c("V_covered", "D_covered", "J_covered"),
to = c("HV gene", "HD gene", "HJ gene")),
gene_covered = factor(gene_covered, levels = c("HV gene", "HD gene", "HJ gene")),
comparator_covered = paste(preprocessing, gene_covered, sep = "_"))
# set comparisons
my_comparisons <- list(c("scifer", "unprocessed"))
my_comparisons <- c(lapply(my_comparisons, paste0, "_HV gene"),
lapply(my_comparisons, paste0, "_HD gene"),
lapply(my_comparisons, paste0, "_HJ gene"))
# The two-tailed Student's t-test with an FDR correction is run
stat.test <- merged_datasets %>%
group_by(gene_covered) %>%
t_test(formula = alignment_covered ~ preprocessing,
paired = FALSE,
comparisons = my_comparisons,
p.adjust.method = "fdr") %>%
add_xy_position(x = "preprocessing") %>%
mutate(p = ifelse(p <= 0.00001, "< 0.0001",p))
# The effect size was calculated using Cohen's method
stat.effectsize <- merged_datasets %>%
group_by(gene_covered) %>%
cohens_d(formula = alignment_covered ~ preprocessing,
comparisons = my_comparisons,
paired = FALSE)
# Staltistical results are merged in a single object
stats <- left_join(stat.test, stat.effectsize,
by = c("group1","group2", ".y.", "n1", "n2", "gene_covered")) %>%
mutate(effsize = paste0("p ",p,", ","d = ",round(effsize, 2)))
merged_datasets %>%
ggpubr::ggviolin(y = "alignment_covered", x = "preprocessing", facet.by = "gene_covered",legend = "none", fill = "preprocessing") +
geom_boxplot(aes(fill = preprocessing),outlier.shape = NA, width = 0.10, color = "black", alpha = 0.2) +
stat_pvalue_manual(stats, label = "effsize", y.position = 115) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(limits = c(0,130) , expand = c(0,0), breaks = seq(0, 100,25)) +
labs(x = "", y = "Alignment coverage") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.position = "none")
ggsave(path = result.dir,filename = "gene_coverage.pdf", height = 5, width = 5)
This code chunk analyzes somatic hypermutation (SHM) in genes V for the “scifer” and “unprocessed” preprocessing methods. Two-tailed Student’s t-test is performed and Cohen’s effect size is calculated. The results are visualized with violin plots.
merged_datasets <- data.table::rbindlist(list(scifer = scifer_aligned,
unprocessed = unprocessed_aligned), idcol = TRUE) %>%
rename(preprocessing = .id)
# set comparisons
my_comparisons <- list(c("scifer", "unprocessed"))
# The two-tailed Student's t-test with an FDR correction is run
stat.test <- merged_datasets %>%
t_test(formula = V_SHM ~ preprocessing,
paired = FALSE,
comparisons = my_comparisons,
p.adjust.method = "fdr") %>%
add_xy_position(x = "preprocessing") %>%
mutate(p = ifelse(p <= 0.00001, "< 0.0001",p))
# The effect size was calculated using Cohen's method
stat.effectsize <- merged_datasets %>%
cohens_d(formula = V_SHM ~ preprocessing,
comparisons = my_comparisons,
paired = FALSE)
# Staltistical results are merged in a single object
stats <- left_join(stat.test, stat.effectsize,
by = c("group1","group2", ".y.", "n1", "n2")) %>%
mutate(effsize = paste0("p ",p,", ","d = ",abs(round(effsize, 2))))
merged_datasets %>%
ggpubr::ggviolin(y = "V_SHM", x = "preprocessing",
legend = "none",
fill = "preprocessing") +
geom_boxplot(aes(fill = preprocessing),outlier.shape = NA, width = 0.10, color = "black", alpha = 0.2) +
stat_pvalue_manual(stats, label = "effsize", y.position = 55) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(limits = c(0,60) , expand = c(0,0)) +
labs(x = "", y = "HV gene mutation (%)") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
ggsave(path = result.dir,filename = "HV_gene_mutation.pdf", height = 5, width = 5)
This section explores the correlation between somatic hypermutation and alignment quality for different preprocessing methods and gene segments (HV, HD, HJ). It calculates Pearson correlation coefficients and presents the results using facet-wrapped scatter plots.
merged_datasets <- data.table::rbindlist(list(scifer = scifer_aligned,
unprocessed = unprocessed_aligned), idcol = "preprocessing") %>%
pivot_longer(cols = c("v_score", "j_score", "d_score"),
names_to = "gene_score",
values_to = "alignment_score") %>%
pivot_longer(cols = c("V_errors", "D_errors", "J_errors"),
names_to = "gene_mutations",
values_to = "number_mutations") %>%
mutate(gene_score = plyr::mapvalues(gene_score,
from = c("v_score", "d_score", "j_score"),
to = c("HV gene", "HD gene", "HJ gene")),
gene_score = factor(gene_score, levels = c("HV gene", "HD gene", "HJ gene")),
gene_mutations = plyr::mapvalues(gene_mutations,
from = c("V_errors", "D_errors", "J_errors"),
to = c("HV mutations", "HD mutations", "HJ mutations")),
gene_mutations = factor(gene_mutations, levels = c("HV mutations", "HD mutations", "HJ mutations")))
# Define the combinations to keep
combinations_to_keep <- data.frame(
preprocessing = c("scifer", "unprocessed"),
gene_score = c("HV gene", "HV gene", "HJ gene", "HJ gene", "HD gene", "HD gene"),
gene_mutations = c("HV mutations", "HV mutations", "HJ mutations", "HJ mutations", "HD mutations", "HD mutations")
)
# Join the combinations to keep with the original dataset
filtered_datasets <- merge(combinations_to_keep, merged_datasets, by = c("preprocessing", "gene_score", "gene_mutations"), all.x = TRUE) %>%
mutate(
perc_mutations = case_when(
gene_score == "HV gene" ~ (number_mutations/(v_sequence_end-v_sequence_start))*100,
gene_score == "HD gene" ~ (number_mutations/(d_sequence_end-d_sequence_start))*100,
gene_score == "HJ gene" ~ (number_mutations/(j_sequence_end-j_sequence_start))*100))
filtered_datasets$gene_score <- factor(filtered_datasets$gene_score,
levels = c("HV gene", "HD gene", "HJ gene"))
# Create the grid of plots
plots <- filtered_datasets %>%
ggplot(aes(x = perc_mutations, y = alignment_score)) +
geom_point(color = "grey50", alpha = .1) +
geom_smooth(aes(color = preprocessing), method = "lm") +
stat_cor(p.accuracy = 0.0001, r.accuracy = 0.01, label.x = 20, method = "pearson", label.sep = "\n", cor.coef.name = "r") +
scale_y_continuous(expand = c(0,0), limits = c(0, 500)) +
scale_x_continuous(expand = c(0,0)) +
scale_color_manual(values = color_palette) +
facet_wrap(~ preprocessing + gene_score + gene_mutations, nrow = 2) +
labs(x = "Gene mutations (%)", y = "Alignment score") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
legend.position = "none")
plots
ggsave(path = result.dir, filename = "correlation_alignment_mutations.pdf", height = 5, width = 12)
This code chunk examines the correlation between somatic hypermutation in the HV gene and Phred scores for the “scifer” and “unprocessed” preprocessing methods. It calculates Pearson correlation coefficients and visualizes the results with scatter plots.
merged_datasets <- data.table::rbindlist(list(scifer = scifer_aligned,
unprocessed = unprocessed_aligned), idcol = "preprocessing") %>%
pivot_longer(cols = c("V_errors", "D_errors", "J_errors"),
names_to = "gene_mutations",
values_to = "number_mutations") %>%
mutate(gene_mutations = plyr::mapvalues(gene_mutations,
from = c("V_errors", "D_errors", "J_errors"),
to = c("HV mutations", "HD mutations", "HJ mutations")),
gene_mutations = factor(gene_mutations, levels = c("HV mutations", "HD mutations", "HJ mutations")),
comparator_score = paste(preprocessing,"number_mutations", sep = "_"))
cor_phred_score <- hc_quality_cluster %>%
right_join(merged_datasets %>% mutate(sequence_id = gsub("_PreF|_PostF|_NA", "", sequence_id)), by = "sequence_id")
cor_phred_score %>%
filter(gene_mutations == "HV mutations") %>%
ggplot(aes(x = number_mutations, y = raw.mean.quality)) +
geom_point(color = "grey50", alpha = .1) +
geom_smooth(aes(color = preprocessing), method = "lm") +
stat_cor(p.accuracy = 0.0001, r.accuracy = 0.01, label.x = 60, method = "pearson", label.sep = "\n", cor.coef.name = "r") +
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0)) +
scale_color_manual(values = color_palette) +
facet_wrap(~ preprocessing + gene_mutations, nrow = 1) +
labs(x = "HV gene mutations (#)", y = "Mean Phred score") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(linewidth = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12),
legend.position = "none")
ggsave(path = result.dir, filename = "correlation_alignment_HVmutation.pdf", height = 5, width = 5)
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/scifer_analysis/lib/libopenblasp-r0.3.24.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggprism_1.0.4 rstatix_0.7.2 ggpubr_0.6.0
## [4] microbenchmark_1.4.10 tidyr_1.3.0 ggsci_3.0.0
## [7] factoextra_1.0.7 ggplot2_3.4.3 cluster_2.1.4
## [10] dplyr_1.1.3 scifer_1.0.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-163 bitops_1.0-7 matrixStats_1.0.0
## [4] bit64_4.0.5 webshot_0.5.5 httr_1.4.7
## [7] rprojroot_2.0.3 GenomeInfoDb_1.34.9 backports_1.4.1
## [10] tools_4.2.2 bslib_0.5.1 utf8_1.2.3
## [13] R6_2.5.1 mgcv_1.9-0 DBI_1.1.3
## [16] BiocGenerics_0.44.0 colorspace_2.1-0 withr_2.5.1
## [19] tidyselect_1.2.0 gridExtra_2.3 bit_4.0.5
## [22] compiler_4.2.2 cli_3.6.1 rvest_1.0.3
## [25] Biobase_2.58.0 xml2_1.3.3 labeling_0.4.3
## [28] sass_0.4.7 scales_1.2.1 systemfonts_1.0.4
## [31] stringr_1.5.0 digest_0.6.33 rmarkdown_2.25
## [34] sangerseqR_1.34.0 svglite_2.1.1 XVector_0.38.0
## [37] pkgconfig_2.0.3 htmltools_0.5.6 fastmap_1.1.1
## [40] rlang_1.1.1 rstudioapi_0.15.0 flowCore_2.10.0
## [43] RSQLite_2.3.1 shiny_1.7.5 farver_2.1.1
## [46] jquerylib_0.1.4 generics_0.1.3 jsonlite_1.8.7
## [49] car_3.1-2 RCurl_1.98-1.12 magrittr_2.0.3
## [52] kableExtra_1.3.4 GenomeInfoDbData_1.2.9 Matrix_1.6-1.1
## [55] RProtoBufLib_2.10.0 Rcpp_1.0.11 munsell_0.5.0
## [58] S4Vectors_0.36.0 fansi_1.0.4 abind_1.4-5
## [61] DECIPHER_2.26.0 lifecycle_1.0.3 stringi_1.7.12
## [64] yaml_2.3.7 carData_3.0-5 zlibbioc_1.44.0
## [67] plyr_1.8.9 grid_4.2.2 blob_1.2.4
## [70] parallel_4.2.2 promises_1.2.1 ggrepel_0.9.3
## [73] crayon_1.5.2 lattice_0.21-9 splines_4.2.2
## [76] Biostrings_2.66.0 knitr_1.44 pillar_1.9.0
## [79] ggsignif_0.6.4 stats4_4.2.2 glue_1.6.2
## [82] evaluate_0.22 data.table_1.14.8 vctrs_0.6.3
## [85] httpuv_1.6.11 gtable_0.3.4 purrr_1.0.2
## [88] cachem_1.0.8 xfun_0.40 mime_0.12
## [91] broom_1.0.5 xtable_1.8-4 later_1.3.1
## [94] viridisLite_0.4.2 tibble_3.2.1 cytolib_2.10.0
## [97] memoise_2.0.1 IRanges_2.32.0 ellipsis_0.3.2
## [100] here_1.0.1