Abstract
Here you can find how the processing and plotting related to our B cell receptor clonotype analysis was performed for the paper entitled “Multivalent Antigen Display on Nanoparticles Elicits Diverse and Broad B cell Responses”This Rmarkdown
provides the analysis for the processed
data in the paper entitled “Multivalent Antigen Display on Nanoparticles
Diversifies B Cell Responses”. It includes the B cell receptors
analysis, plots, and generation of intermediate files for plotting using
other programs. The processed files used here are AIRR-compliant
datasets which can be downloaded from Zenodo 7895251), they
are either Human Respiratory Syncytial Virus-specific (RSV-specific) or
the full bulk repertoire from the vaccinated non-human primates
(Macaca mulatta).
library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)
The data used here is stored as .rds
format or as a
table (.csv
or .tsv
) and it can be found in
the Zenodo repository
7895251).
data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
mutate(group = case_when(group == "NP 100%" ~ "20-mer",
group == "NP 50%" ~ "10-mer",
group == "Sol" ~ "1-mer",
group == "PostF" ~ "PostF"))
clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")
# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
filter(timepoint != "PV") %>%
mutate(cdr3_aa_length = nchar(CDR3_aa),
cdr3_aa = CDR3_aa,
v_call = V_gene,
j_call = J_gene,
d_call = D_gene)
# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))
# load mAbs data
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")
# load competition ELISA data
# edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-04-28_LOR_norm-dAUC.csv", header = T, row.names = 1, encoding="UTF-8") %>%
replace(is.na(.), 0) %>%
mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF", "w.b.")),
epitope = factor(epitope, levels = c("Ø","V", "IV","III", "II","I", "foldon", "UK-Pre","UK-DP","PostF", "WB")))
# load light chains information clonotyped
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE)
The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Here we show this data using different visualization methods.
Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.
cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>%
select(-c(timepoints, ID, group)) %>%
as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)
mds <- D_sim %>%
cmdscale(3) %>%
as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")
mds <- mds %>%
mutate(group = x$group,
timepoints = x$timepoints,
ID = x$ID)
return(mds)
}
The code below plots 4 different MDS plots. They are divided between
2 weeks after Boost 1 (B1) and Boost 2 (B2), with or without animals
vaccinated with PostF immunogen. On the top of each plot, it is written
B1 or B2, and legends say if PostF animals are included or not. The
cosine distance was calculated separately using the custom
cosine_distance()
provided above, for that reason, position
of points should not be compared directly between each plot but rather
with points within each plot.
# calculate cosine distance and generate MDS with and without PostF
mds_b1 <- data_comp %>% filter(timepoints == "B1") %>% cosine_distance()
mds_b2 <- data_comp %>% filter(timepoints == "B2") %>% cosine_distance()
mds_no_postf_b1 <- data_comp %>% filter(group != "PostF", timepoints == "B1") %>% cosine_distance()
mds_no_postf_b2 <- data_comp %>% filter(group != "PostF", timepoints == "B2") %>% cosine_distance()
ls_mds <- list(mds_b1, mds_b2, mds_no_postf_b1, mds_no_postf_b2)
plot_list <- list()
for(f in seq_along(ls_mds)){
# Plot and color by groups
mds_plot <- ls_mds[[f]]
if(f %in% c(2,4)){
mds_plot$Dim.1 <- mds_plot$Dim.1 * -1 # flip axis in first dimension
}
plot <- mds_plot %>%
ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) +
geom_point(size = 4, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
lims(x = c(-.5,.5), y = c(-.4, .4)) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
legend.position = c(.2,.2),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
facet_wrap(~timepoints)
plot_list[[f]] <- plot
ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 4, height = 4)
}
ggarrange(plot_list[[3]], plot_list[[4]], plot_list[[1]], plot_list[[2]], ncol = 2, nrow = 2, common.legend = TRUE,legend = "top", legend.grob = get_legend(plot_list[[2]] + theme(legend.position = "top")))
The code below generates bar plots to visualize the competition
titers for epitopes on the PreF or PostF RSV antigen. The code starts by
processing the data and categorizing the epitopes based on specific
markers. It then defines various pairwise comparisons for statistical
testing between different groups for each timepoint with
fdr
correction. Afterward, it performs a Wilcoxon rank-sum
test and displays the results in a tabular format. Finally, the code
creates a series of bar plots for each epitope, showing the competition
titers at different timepoints, with data points represented as colored
points, and a summary of the mean values indicated by crossbars. The
plots are grouped based on the specific epitopes, providing an
insightful visualization of the competition titer dynamics for different
epitopes on the PreF antigen.
The code focused solely on the statistical tests, analysis, and
p-value correction can be seen on the
rsv_statistical_analysis.Rmd
or rendered
as a website here.
data_comp_longer <- data_comp %>%
pivot_longer(cols = 2:9, names_to = "mAb", values_to = "ELISA_competition") %>%
mutate(epitopes = plyr::mapvalues(mAb,
from = c("D25.PreF", "MPE8.PreF", "ADI.PreF", "Pali.PreF", "X101F.PreF",
"ADI.PostF", "X101F.PostF", "Pali.PostF"),
to = c("Ø", "III", "I", "II", "IV",
"I", "IV", "II")),
epitopes = factor(epitopes, levels = c("Ø", "III", "IV", "II", "I")),
conformation = factor(case_when(grepl("PostF", mAb) ~ "PostF",
grepl("PreF", mAb) ~ "PreF"), levels = c("PreF", "PostF")),
timepoint_group = paste(timepoints, group, sep = "_"),
timepoint_group_epitope = paste(timepoints, group, epitopes, sep = "_"))
my_comparisons_sites <- list(c("B1_1-mer", "B1_20-mer"),
c("B1_1-mer", "B1_10-mer"),
c("B1_1-mer", "B1_PostF"),
c("B1_10-mer", "B1_20-mer"),
c("B1_10-mer", "B1_PostF"),
c("B1_20-mer", "B1_PostF"),
c("B2_1-mer", "B2_20-mer"),
c("B2_1-mer", "B2_10-mer"),
c("B2_1-mer", "B2_PostF"),
c("B2_10-mer", "B2_20-mer"),
c("B2_10-mer", "B2_PostF"),
c("B2_20-mer", "B2_PostF"))
my_comparisons_1 <- lapply(my_comparisons_sites, paste0, "_I")
my_comparisons_2 <- lapply(my_comparisons_sites, paste0, "_II")
my_comparisons_3 <- lapply(my_comparisons_sites, paste0, "_III")
my_comparisons_4 <- lapply(my_comparisons_sites, paste0, "_IV")
my_comparisons_5 <- lapply(my_comparisons_sites, paste0, "_Ø")
# Selecting sites for statistical comparison only between groups for each timepoint
# removed statistical comparisons for Boost 1 site I in PreF since most values were below the threshold of detection
my_comparisons_preF <- c(my_comparisons_1[7:12], my_comparisons_2, my_comparisons_3, my_comparisons_4, my_comparisons_5)
my_comparisons_postF <- c(my_comparisons_1, my_comparisons_2, my_comparisons_4)
stat.test <- data_comp_longer %>%
filter(conformation == "PreF") %>%
wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope,
paired = FALSE,
p.adjust.method = "fdr",
comparisons = my_comparisons_preF)
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif |
---|---|---|---|---|---|---|---|---|
ELISA_competition | B2_1-mer_I | B2_20-mer_I | 4 | 5 | 14.0 | 0.413 | 0.812 | ns |
ELISA_competition | B2_1-mer_I | B2_10-mer_I | 4 | 5 | 20.0 | 0.020 | 0.176 | ns |
ELISA_competition | B2_1-mer_I | B2_PostF_I | 4 | 4 | 0.0 | 0.029 | 0.190 | ns |
ELISA_competition | B2_10-mer_I | B2_20-mer_I | 5 | 5 | 6.0 | 0.205 | 0.527 | ns |
ELISA_competition | B2_10-mer_I | B2_PostF_I | 5 | 4 | 0.0 | 0.020 | 0.176 | ns |
ELISA_competition | B2_20-mer_I | B2_PostF_I | 5 | 4 | 2.0 | 0.064 | 0.343 | ns |
ELISA_competition | B1_1-mer_II | B1_20-mer_II | 4 | 5 | 3.0 | 0.111 | 0.428 | ns |
ELISA_competition | B1_1-mer_II | B1_10-mer_II | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
ELISA_competition | B1_1-mer_II | B1_PostF_II | 4 | 4 | 12.0 | 0.343 | 0.772 | ns |
ELISA_competition | B1_10-mer_II | B1_20-mer_II | 5 | 5 | 6.0 | 0.222 | 0.545 | ns |
ELISA_competition | B1_10-mer_II | B1_PostF_II | 5 | 4 | 16.0 | 0.190 | 0.513 | ns |
ELISA_competition | B1_20-mer_II | B1_PostF_II | 5 | 4 | 19.0 | 0.032 | 0.190 | ns |
ELISA_competition | B2_1-mer_II | B2_20-mer_II | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
ELISA_competition | B2_1-mer_II | B2_10-mer_II | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
ELISA_competition | B2_1-mer_II | B2_PostF_II | 4 | 4 | 6.0 | 0.686 | 0.939 | ns |
ELISA_competition | B2_10-mer_II | B2_20-mer_II | 5 | 5 | 9.0 | 0.548 | 0.883 | ns |
ELISA_competition | B2_10-mer_II | B2_PostF_II | 5 | 4 | 6.0 | 0.413 | 0.812 | ns |
ELISA_competition | B2_20-mer_II | B2_PostF_II | 5 | 4 | 9.0 | 0.905 | 0.958 | ns |
ELISA_competition | B1_1-mer_III | B1_20-mer_III | 4 | 5 | 7.0 | 0.556 | 0.883 | ns |
ELISA_competition | B1_1-mer_III | B1_10-mer_III | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
ELISA_competition | B1_1-mer_III | B1_PostF_III | 4 | 4 | 13.5 | 0.124 | 0.446 | ns |
ELISA_competition | B1_10-mer_III | B1_20-mer_III | 5 | 5 | 13.0 | 1.000 | 1.000 | ns |
ELISA_competition | B1_10-mer_III | B1_PostF_III | 5 | 4 | 20.0 | 0.018 | 0.176 | ns |
ELISA_competition | B1_20-mer_III | B1_PostF_III | 5 | 4 | 20.0 | 0.018 | 0.176 | ns |
ELISA_competition | B2_1-mer_III | B2_20-mer_III | 4 | 5 | 4.0 | 0.190 | 0.513 | ns |
ELISA_competition | B2_1-mer_III | B2_10-mer_III | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
ELISA_competition | B2_1-mer_III | B2_PostF_III | 4 | 4 | 11.0 | 0.486 | 0.883 | ns |
ELISA_competition | B2_10-mer_III | B2_20-mer_III | 5 | 5 | 8.0 | 0.421 | 0.812 | ns |
ELISA_competition | B2_10-mer_III | B2_PostF_III | 5 | 4 | 11.0 | 0.905 | 0.958 | ns |
ELISA_competition | B2_20-mer_III | B2_PostF_III | 5 | 4 | 16.0 | 0.190 | 0.513 | ns |
ELISA_competition | B1_1-mer_IV | B1_20-mer_IV | 4 | 5 | 4.0 | 0.190 | 0.513 | ns |
ELISA_competition | B1_1-mer_IV | B1_10-mer_IV | 4 | 5 | 6.0 | 0.413 | 0.812 | ns |
ELISA_competition | B1_1-mer_IV | B1_PostF_IV | 4 | 4 | 8.0 | 1.000 | 1.000 | ns |
ELISA_competition | B1_10-mer_IV | B1_20-mer_IV | 5 | 5 | 11.0 | 0.841 | 0.958 | ns |
ELISA_competition | B1_10-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.883 | ns |
ELISA_competition | B1_20-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.883 | ns |
ELISA_competition | B2_1-mer_IV | B2_20-mer_IV | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
ELISA_competition | B2_1-mer_IV | B2_10-mer_IV | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
ELISA_competition | B2_1-mer_IV | B2_PostF_IV | 4 | 4 | 7.0 | 0.886 | 0.958 | ns |
ELISA_competition | B2_10-mer_IV | B2_20-mer_IV | 5 | 5 | 4.0 | 0.095 | 0.428 | ns |
ELISA_competition | B2_10-mer_IV | B2_PostF_IV | 5 | 4 | 5.0 | 0.286 | 0.671 | ns |
ELISA_competition | B2_20-mer_IV | B2_PostF_IV | 5 | 4 | 12.0 | 0.730 | 0.939 | ns |
ELISA_competition | B1_1-mer_Ø | B1_20-mer_Ø | 4 | 5 | 11.0 | 0.898 | 0.958 | ns |
ELISA_competition | B1_1-mer_Ø | B1_10-mer_Ø | 4 | 5 | 12.0 | 0.701 | 0.939 | ns |
ELISA_competition | B1_1-mer_Ø | B1_PostF_Ø | 4 | 4 | 12.0 | 0.186 | 0.513 | ns |
ELISA_competition | B1_10-mer_Ø | B1_20-mer_Ø | 5 | 5 | 10.0 | 0.666 | 0.939 | ns |
ELISA_competition | B1_10-mer_Ø | B1_PostF_Ø | 5 | 4 | 16.0 | 0.109 | 0.428 | ns |
ELISA_competition | B1_20-mer_Ø | B1_PostF_Ø | 5 | 4 | 16.0 | 0.109 | 0.428 | ns |
ELISA_competition | B2_1-mer_Ø | B2_20-mer_Ø | 4 | 5 | 13.0 | 0.556 | 0.883 | ns |
ELISA_competition | B2_1-mer_Ø | B2_10-mer_Ø | 4 | 5 | 12.0 | 0.730 | 0.939 | ns |
ELISA_competition | B2_1-mer_Ø | B2_PostF_Ø | 4 | 4 | 16.0 | 0.029 | 0.190 | ns |
ELISA_competition | B2_10-mer_Ø | B2_20-mer_Ø | 5 | 5 | 13.0 | 1.000 | 1.000 | ns |
ELISA_competition | B2_10-mer_Ø | B2_PostF_Ø | 5 | 4 | 20.0 | 0.020 | 0.176 | ns |
ELISA_competition | B2_20-mer_Ø | B2_PostF_Ø | 5 | 4 | 20.0 | 0.020 | 0.176 | ns |
data_comp_longer %>%
filter(conformation == "PreF") %>%
ggplot(aes(x = timepoint_group,
y = ELISA_competition,
color = group,
fill = group)) +
geom_point(size = 2, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
scale_y_log10() +
stat_summary(fun.y = mean,
geom = "crossbar",
color = "black") +
geom_hline(yintercept = 10, linetype = "dashed") +
labs(title = "PreF Antigenic Sites", y = "50% Competition Titer", x = "") +
theme(axis.ticks = element_line(size = .5),
legend.background = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
facet_wrap(~epitopes, nrow = 1)
stat.test <- data_comp_longer %>%
filter(conformation == "PostF") %>%
wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope,
paired = FALSE,
p.adjust.method = "fdr",
comparisons = my_comparisons_postF) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ELISA_competition | B1_1-mer_I | B1_20-mer_I | 4 | 5 | 0.0 | 0.020 | 0.088 | ns | 6533.097 | B1_1-mer…. | 1 | 7 |
ELISA_competition | B1_1-mer_I | B1_10-mer_I | 4 | 5 | 0.0 | 0.020 | 0.088 | ns | 7250.111 | B1_1-mer…. | 1 | 4 |
ELISA_competition | B1_1-mer_I | B1_PostF_I | 4 | 4 | 1.0 | 0.059 | 0.127 | ns | 7967.125 | B1_1-mer…. | 1 | 10 |
ELISA_competition | B1_10-mer_I | B1_20-mer_I | 5 | 5 | 3.0 | 0.056 | 0.127 | ns | 8684.139 | B1_10-me…. | 4 | 7 |
ELISA_competition | B1_10-mer_I | B1_PostF_I | 5 | 4 | 9.0 | 0.905 | 0.905 | ns | 9401.153 | B1_10-me…. | 4 | 10 |
ELISA_competition | B1_20-mer_I | B1_PostF_I | 5 | 4 | 12.0 | 0.730 | 0.773 | ns | 10118.168 | B1_20-me…. | 7 | 10 |
ELISA_competition | B2_1-mer_I | B2_20-mer_I | 4 | 5 | 0.0 | 0.016 | 0.088 | ns | 10835.182 | B2_1-mer…. | 13 | 19 |
ELISA_competition | B2_1-mer_I | B2_10-mer_I | 4 | 5 | 5.0 | 0.286 | 0.381 | ns | 11552.196 | B2_1-mer…. | 13 | 16 |
ELISA_competition | B2_1-mer_I | B2_PostF_I | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 12269.210 | B2_1-mer…. | 13 | 22 |
ELISA_competition | B2_10-mer_I | B2_20-mer_I | 5 | 5 | 7.0 | 0.310 | 0.385 | ns | 12986.224 | B2_10-me…. | 16 | 19 |
ELISA_competition | B2_10-mer_I | B2_PostF_I | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 13703.238 | B2_10-me…. | 16 | 22 |
ELISA_competition | B2_20-mer_I | B2_PostF_I | 5 | 4 | 3.0 | 0.111 | 0.195 | ns | 14420.252 | B2_20-me…. | 19 | 22 |
ELISA_competition | B1_1-mer_II | B1_20-mer_II | 4 | 5 | 3.5 | 0.125 | 0.205 | ns | 15137.266 | B1_1-mer…. | 2 | 8 |
ELISA_competition | B1_1-mer_II | B1_10-mer_II | 4 | 5 | 8.5 | 0.771 | 0.793 | ns | 15854.280 | B1_1-mer…. | 2 | 5 |
ELISA_competition | B1_1-mer_II | B1_PostF_II | 4 | 4 | 0.0 | 0.026 | 0.088 | ns | 16571.294 | B1_1-mer…. | 2 | 11 |
ELISA_competition | B1_10-mer_II | B1_20-mer_II | 5 | 5 | 6.5 | 0.236 | 0.327 | ns | 17288.309 | B1_10-me…. | 5 | 8 |
ELISA_competition | B1_10-mer_II | B1_PostF_II | 5 | 4 | 1.0 | 0.034 | 0.088 | ns | 18005.323 | B1_10-me…. | 5 | 11 |
ELISA_competition | B1_20-mer_II | B1_PostF_II | 5 | 4 | 2.0 | 0.064 | 0.127 | ns | 18722.337 | B1_20-me…. | 8 | 11 |
ELISA_competition | B2_1-mer_II | B2_20-mer_II | 4 | 5 | 1.0 | 0.032 | 0.088 | ns | 19439.351 | B2_1-mer…. | 14 | 20 |
ELISA_competition | B2_1-mer_II | B2_10-mer_II | 4 | 5 | 4.0 | 0.190 | 0.297 | ns | 20156.365 | B2_1-mer…. | 14 | 17 |
ELISA_competition | B2_1-mer_II | B2_PostF_II | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 20873.379 | B2_1-mer…. | 14 | 23 |
ELISA_competition | B2_10-mer_II | B2_20-mer_II | 5 | 5 | 8.0 | 0.421 | 0.474 | ns | 21590.393 | B2_10-me…. | 17 | 20 |
ELISA_competition | B2_10-mer_II | B2_PostF_II | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 22307.407 | B2_10-me…. | 17 | 23 |
ELISA_competition | B2_20-mer_II | B2_PostF_II | 5 | 4 | 2.0 | 0.064 | 0.127 | ns | 23024.421 | B2_20-me…. | 20 | 23 |
ELISA_competition | B1_1-mer_IV | B1_20-mer_IV | 4 | 5 | 1.0 | 0.032 | 0.088 | ns | 23741.435 | B1_1-mer…. | 3 | 9 |
ELISA_competition | B1_1-mer_IV | B1_10-mer_IV | 4 | 5 | 4.5 | 0.219 | 0.320 | ns | 24458.449 | B1_1-mer…. | 3 | 6 |
ELISA_competition | B1_1-mer_IV | B1_PostF_IV | 4 | 4 | 2.0 | 0.114 | 0.195 | ns | 25175.464 | B1_1-mer…. | 3 | 12 |
ELISA_competition | B1_10-mer_IV | B1_20-mer_IV | 5 | 5 | 6.0 | 0.222 | 0.320 | ns | 25892.478 | B1_10-me…. | 6 | 9 |
ELISA_competition | B1_10-mer_IV | B1_PostF_IV | 5 | 4 | 6.0 | 0.413 | 0.474 | ns | 26609.492 | B1_10-me…. | 6 | 12 |
ELISA_competition | B1_20-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.607 | ns | 27326.506 | B1_20-me…. | 9 | 12 |
ELISA_competition | B2_1-mer_IV | B2_20-mer_IV | 4 | 5 | 0.0 | 0.016 | 0.088 | ns | 28043.520 | B2_1-mer…. | 15 | 21 |
ELISA_competition | B2_1-mer_IV | B2_10-mer_IV | 4 | 5 | 6.0 | 0.413 | 0.474 | ns | 28760.534 | B2_1-mer…. | 15 | 18 |
ELISA_competition | B2_1-mer_IV | B2_PostF_IV | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 29477.548 | B2_1-mer…. | 15 | 24 |
ELISA_competition | B2_10-mer_IV | B2_20-mer_IV | 5 | 5 | 7.0 | 0.310 | 0.385 | ns | 30194.562 | B2_10-me…. | 18 | 21 |
ELISA_competition | B2_10-mer_IV | B2_PostF_IV | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 30911.576 | B2_10-me…. | 18 | 24 |
ELISA_competition | B2_20-mer_IV | B2_PostF_IV | 5 | 4 | 3.0 | 0.111 | 0.195 | ns | 31628.590 | B2_20-me…. | 21 | 24 |
data_comp_longer %>%
filter(conformation == "PostF") %>%
ggplot(aes(x = timepoint_group,
y = ELISA_competition,
color = group,
fill = group)) +
geom_point(size = 2, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
scale_y_log10() +
stat_summary(fun.y = mean,
geom = "crossbar",
color = "black") +
geom_hline(yintercept = 10, linetype = "dashed") +
labs(title = "PostF Antigenic Sites", y = "50% Competition Titer", x = "") +
theme(axis.ticks = element_line(size = .5),
legend.background = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
facet_wrap(~epitopes, nrow = 1)
Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.
rep_seq_stat <- lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <- lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE)
rep_seq_stat <- rep_seq_stat %>%
select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))
write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )
After merging the antigen-specific sequences and bulk sequencing, we
run the the clonotype module
from IgDiscover to assign
clonotype grouping to each sequence. We have done that for both the
individualized germline and the KIMDB database. Here we are reading
those datasets and doing wrangling the data in the required format.
ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls) & grepl("individualized|nestor-rm", ls) ]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
mutate(
Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
ID = gsub("_.*", "", ID_timepoint),
database = plyr::mapvalues(.id,
from = c("nestor-rm", "individualized"),
to = c("KIMDB", "Individualized")
),
database = factor(database, levels = c("KIMDB", "Individualized")),
Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
v_call = ifelse(is.na(v_call), V_gene, v_call),
j_call = ifelse(is.na(j_call), J_gene, j_call)
) %>%
group_by(.id, ID_timepoint) %>%
distinct(new_name, .keep_all = TRUE) %>%
ungroup() %>%
filter(database %in% c("KIMDB", "Individualized"))
Plotting sequencing depth for each timepoint per animal. This indicates that the sequencing depth for Boost 1 was lower, thus for that reason all the analysis were normalized by sequencing depth to take that into account.
# samples from the biological replicates after boost 2 (BR2-B2) from E17 were merged with the technical replicates after boost 2 (TR1-B2)
# this was done due to low sequencing depth for that animal and large expansion of uncharacterized clones
full_rep_indiv <- read.table("../data/processed_data/summary_sequencing_table.tsv",
header = TRUE) %>%
separate(ID, sep = "_", into = c("sample", "ID")) %>%
filter(sample != "TR2-B2") %>%
mutate(sample = ifelse(sample == "BR2-B2", "TR1-B2", sample),
timepoint = case_when(sample == "igm" ~ "PV",
grepl("B1", sample) ~ "B1",
grepl("B2", sample) ~ "B2"),
timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
group_by(timepoint, ID) %>%
summarise(across(where(is.numeric), sum)) %>%
ungroup()
# since here we wanted to compare all groups, we used Dunn's test
stat.test <- full_rep_indiv %>%
dunn_test(formula = assignment_filtering.has_cdr3 ~ timepoint,
p.adjust.method = "fdr") %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
assignment_filtering.has_cdr3 | PV | B1 | 9 | 9 | -3.6525703 | 0.0002596 | 0.0007789 | *** | 1393265 | PV, B1 | 1 | 2 |
assignment_filtering.has_cdr3 | PV | B2 | 9 | 9 | -0.2672612 | 0.7892680 | 0.7892680 | ns | 1491771 | PV, B2 | 1 | 3 |
assignment_filtering.has_cdr3 | B1 | B2 | 9 | 9 | 3.3853091 | 0.0007110 | 0.0010665 | ** | 1590276 | B1, B2 | 2 | 3 |
full_rep_indiv %>%
ggdotplot(y = "assignment_filtering.has_cdr3", x = "timepoint",
group = "timepoint",
fill = "ID",
size = 1,) +
geom_boxplot(alpha = .2, outlier.shape = NA) +
ggpubr::stat_compare_means(method = "kruskal") +
labs(x = "Timepoint", y = "# good quality aligned sequences") +
scale_fill_viridis_d() +
stat_pvalue_manual(stat.test %>% mutate(p.adj = round(p.adj, 4)), label = "p.adj") +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)
write.csv(full_rep_indiv, paste0("../", result.dir, "repertoire_depth.csv"), row.names = F)
ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)
rds_summary <- rds_merge %>%
group_by(database, ID_timepoint, grp) %>%
summarise(
ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
clonal_size = n(),
database,
sc_clone_grp = first(sc_clone_grp),
V_errors = mean(V_errors),
specificity_group = first(specificity_group),
cdr3_aa_length = mean(cdr3_aa_length),
v_call = first(v_call),
j_call = first(j_call)
) %>%
ungroup() %>%
group_by(database, ID_timepoint) %>%
mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
ungroup() %>%
distinct()
rds_summary_noPV <- rds_summary %>%
filter(Timepoint != "PV", Timepoint != "Single-cell")
rds_summary_save <- rds_summary %>%
filter(Timepoint %in% c("Single-cell", "B1","B2"),
database == "Individualized") %>%
group_by(ID_timepoint) %>%
summarise(mean_clonal_size = mean(clonal_size),
median_clonal_size = median(clonal_size),
geom_clonal_size = exp(mean(log(clonal_size))),
unique_clones = sum(clonal_size == 1),
total_clones_detected = n(),
percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2))
rds_summary_save %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
ID_timepoint | mean_clonal_size | median_clonal_size | geom_clonal_size | unique_clones | total_clones_detected | percentage_unique_clones |
---|---|---|---|---|---|---|
E11_B1 | 73.765766 | 24.0 | 23.618580 | 7 | 111 | 6.31 |
E11_B2 | 65.842324 | 18.0 | 16.139896 | 22 | 241 | 9.13 |
E11_Single-cell | 1.596491 | 1.0 | 1.310902 | 169 | 228 | 74.12 |
E12_B1 | 96.802632 | 8.5 | 11.635387 | 5 | 76 | 6.58 |
E12_B2 | 62.696296 | 9.0 | 9.733452 | 18 | 135 | 13.33 |
E12_Single-cell | 1.556000 | 1.0 | 1.304827 | 185 | 250 | 74.00 |
E14_B1 | 79.458333 | 9.0 | 9.683393 | 11 | 96 | 11.46 |
E14_B2 | 111.530612 | 10.5 | 12.335661 | 29 | 196 | 14.80 |
E14_Single-cell | 2.041096 | 1.0 | 1.527244 | 139 | 219 | 63.47 |
E16_B1 | 73.535519 | 14.0 | 15.223779 | 14 | 183 | 7.65 |
E16_B2 | 118.491228 | 31.5 | 29.123260 | 17 | 228 | 7.46 |
E16_Single-cell | 1.996094 | 1.0 | 1.489967 | 169 | 256 | 66.02 |
E17_B1 | 38.460526 | 8.0 | 9.753009 | 8 | 76 | 10.53 |
E17_B2 | 55.614035 | 6.0 | 7.902730 | 31 | 171 | 18.13 |
E17_Single-cell | 1.665158 | 1.0 | 1.381309 | 151 | 221 | 68.33 |
E18_B1 | 41.323529 | 11.0 | 11.772174 | 9 | 102 | 8.82 |
E18_B2 | 72.662921 | 21.0 | 15.643111 | 23 | 178 | 12.92 |
E18_Single-cell | 1.641026 | 1.0 | 1.368241 | 135 | 195 | 69.23 |
E21_B1 | 35.201835 | 5.0 | 6.887328 | 21 | 109 | 19.27 |
E21_B2 | 45.202312 | 14.0 | 13.455144 | 16 | 173 | 9.25 |
E21_Single-cell | 1.674208 | 1.0 | 1.385378 | 151 | 221 | 68.33 |
E23_B1 | 59.675497 | 9.0 | 11.417886 | 19 | 151 | 12.58 |
E23_B2 | 63.044025 | 13.0 | 14.180265 | 18 | 159 | 11.32 |
E23_Single-cell | 1.916279 | 1.0 | 1.484490 | 136 | 215 | 63.26 |
E24_B1 | 55.469231 | 10.0 | 11.115858 | 18 | 130 | 13.85 |
E24_B2 | 62.310735 | 9.0 | 10.429075 | 29 | 177 | 16.38 |
E24_Single-cell | 1.829897 | 1.0 | 1.400170 | 137 | 194 | 70.62 |
write.csv(rds_summary_save,
paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)
rds_indiv <- rds_summary %>%
filter(
database == "Individualized",
Timepoint != "Single-cell",
Timepoint != "PV"
) %>%
mutate(
LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
)
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
tidyr::expand(ID, Timepoint, grp) %>%
filter(Timepoint != "Single-cell")
rds_indiv <- rds_indiv %>%
right_join(all) %>%
mutate(
ID_timepoint = paste(ID, Timepoint, sep = "_"),
database = "individualized",
clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
) %>%
arrange(grp, ID_timepoint) %>%
tidyr::fill(LOR, Group, sc_clone_grp)
generates cumulative area plots for clone sizes of 20-mer and 1-mer specificities, further divided by specificity groups (DP & PreF). The code starts by summarizing the clone sizes for each combination of “Group,” “Timepoint,” and “specificity_group” from the “rds_indiv” data. The specificity group “PostF” is filtered out for this plot. The data is then plotted using ggplot2 library to create area plots with the x-axis representing “Timepoint,” the y-axis representing the cumulative “Clonal size”, and the area filled based on “specificity_group” (DP & PreF). Each “Group” is presented as a facet, showing the cumulative growth of clone sizes for the specified specificities over different timepoints. The area plots offer a visual representation of how the clone sizes evolve over time, highlighting differences between the two specificity groups (DP & PreF) for both 20-mer and 1-mer clones.
rds_indiv %>%
group_by(Group, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group)
ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)
rds_indiv %>%
group_by(Group, ID, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 30000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group + ID, ncol = 5)
The provided code plots the clonal sizes from two databases, “KIMDB” and “Individualized”, for different timepoints. It creates scatter plots with clonal sizes from the “Individualized” database on the x-axis and “KIMDB” database on the y-axis, allowing visual comparison and assessment of correlation. The dashed line with a slope of 1 and an intercept of 0 represents the ideal correlation line. The plots are grouped by timepoints, providing a concise visual understanding of the clonal size relationship between the two databases over time.
df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
for (databases in unique(rds_summary_noPV$database)) {
rds_timepoint <- rds_summary_noPV %>%
filter(Timepoint == timepoints, database == databases) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
rds_indiv_corr <- rds_summary_noPV %>%
filter(
database == "Individualized",
Timepoint == timepoints
) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
length(rds_indiv_corr) <- length(rds_timepoint)
df <- data.frame(
size_indiv = rds_indiv_corr,
clonal_size = rds_timepoint,
Timepoint = timepoints,
database = databases
)
df[is.na(df)] <- 0
df_all <- rbind(df, df_all)
}
}
df_all %>%
filter(database != "Individualized") %>%
mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
ggplot(aes(x = size_indiv, y = clonal_size)) +
geom_point(size = 1) +
geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
scale_x_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
scale_y_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ Timepoint)
ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)
The presented R code analyzes IGHV-J pairing for two databases, “Individualized” and “KIMDB.” The code preprocesses data from the “rds_summary” dataframe, extracting V and J gene calls, and calculating clonal size percentages for each V-J combination. It then creates a heatmap plot (plot_vj) with IGHV alleles on the x-axis and IGHJ alleles on the y-axis, using the fill color to represent the clonal size mean percentage for each V-J pairing. The heatmap is grouped by specificity groups (“Total,” “PreF,” and “DP”) and clonotype type (“20-mer” and “1-mer”) for each database, providing a compact visual comparison of the IGHV-J pairing distributions between the two databases.
for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
# check if we want to filter clonotypes by size or not
filter(database == i) %>%
rbind(., within(., specificity_group <- "Total")) %>%
mutate(v_call = sub("\\*.*","", v_call),
j_call = sub("\\*.*|-.*","", j_call),
clonal_size_log = log10(clonal_size),
Group = factor(Group, levels = c("20-mer", "1-mer")),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
group_by(ID) %>%
mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
ungroup() %>%
group_by(Group, specificity_group, v_call, j_call) %>%
summarise(clonal_size_mean_perc = mean(clonal_size_perc), clonal_size_mean = mean(clonal_size)) %>%
filter(specificity_group != "PostF") %>% droplevels()
plot_vj <- rds_summary_noPV %>%
ggplot(aes(x= v_call, y = j_call, fill = clonal_size_mean_perc)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1, limits = c(0, 1.5), breaks = c(0,0.5, 1, 1.5)) +
scale_y_discrete(position = "right") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
coord_equal() +
theme(axis.text.x = element_text(angle = 90, size = 5, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 5),
strip.text = element_text(face = "bold", size = 7),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top",
axis.ticks = element_line(size = .5),
legend.title = element_text()) +
labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles", title = paste0(i, " Database")) +
facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")
print(plot_vj)
ggsave(plot = plot_vj, filename = paste0("../",result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}
The provided R code counts and compares unique V-J pairing occurrences in the “Individualized” database. It preprocesses the data, calculates the number of unique V-J pairs for each group, and performs statistical tests. The results are visualized using a dot plot with mean values and p-value labels, enabling easy comparison of unique V-J pairing counts among different groups.
# groups to perform statistical comparisons
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
c("Total_B2_20-mer", "Total_B2_1-mer"),
c("PreF_B1_20-mer", "PreF_B1_1-mer"),
c("PreF_B2_20-mer", "PreF_B2_1-mer"),
c("DP_B1_20-mer", "DP_B1_1-mer"),
c("DP_B2_20-mer", "DP_B2_1-mer"))
rds_summary_noPV <- rds_summary %>%
# If want to remove clonal sizer < 1, do it here.
filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls) %>%
summarise(unique_v_j = n()) %>%
ungroup()
rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)
stat.test <- rds_summary_noPV %>%
wilcox_test(formula = unique_v_j ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
unique_v_j | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 14 | 0.413 | 0.496 | ns | 137.040 | Total_B1…. | 1 | 2 |
unique_v_j | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 11 | 0.901 | 0.901 | ns | 150.288 | Total_B2…. | 3 | 4 |
unique_v_j | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 4 | 0.176 | 0.264 | ns | 163.536 | PreF_B1_…. | 5 | 6 |
unique_v_j | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 0 | 0.016 | 0.048 |
|
176.784 | PreF_B2_…. | 7 | 8 |
unique_v_j | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
190.032 | DP_B1_20…. | 9 | 10 |
unique_v_j | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 18 | 0.064 | 0.127 | ns | 203.280 | DP_B2_20…. | 11 | 12 |
rds_summary_noPV %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)
The provided R code generates a Venn diagram to visualize the sharing of HV-HJ combinations between “20-mer” and “1-mer” clonotypes in the “Individualized” database. The code preprocesses the data from the “rds_summary” dataframe, filtering out “PostF” specificity and organizing the data based on “Group” (clonotype type). It then creates two lists, one for “20-mer” and another for “1-mer” clonotypes, containing distinct HV-HJ combinations for each group. The Venn diagram is plotted using ggVennDiagram, representing the shared and unique HV-HJ combinations between the two clonotype types. The diagram provides a concise visual understanding of the overlap and differences in HV-HJ pairing between “20-mer” and “1-mer” clonotypes in the “Individualized” database.
rds_summary_noPV <- rds_summary %>%
# decide if clonal size greater than 1 should be included
filter(database == "Individualized", Timepoint != "Single-cell") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls)
list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
"1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])
ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7,
label_alpha = 0,
set_color = "black",
set_size = 9) +
scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
scale_color_manual(values = c("#5F90B0", "#D896C1")) +
theme(legend.position = "none")
ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)
Intermediate files are commented out, so it will not save all of
them. One could uncomment them if all the intermediate files are needed.
Intermediate files were used to run Recon (v2.5)
(Kaplinsky
& Arnaout, Nat Commmun, 2016) according to default
parameters, more about running Recon is described in the next section Recon estimates for
RSV-specific diversity.
# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))
# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()
# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
# save full clonotype files
filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
filter(ID_timepoint_spec == animals) %>%
as.data.frame()
# write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)
# save summary files
clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
#write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
# save summary files
clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
filter(ID_timepoint == animals) %>%
as.data.frame() %>%
mutate(
specificity_group = "Total",
ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
) %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()
for (animals in unique(clonotype_rsv$ID)) {
# save summary files
clonotype_rsv %>%
filter(ID == animals) %>%
as.data.frame() %>%
group_by(grp, timepoint, ID) %>%
summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
select(clonal_size, ID_timepoint_spec) %>%
group_by(clonal_size, ID_timepoint_spec) %>%
summarise(size = n()) %>%
ungroup()
for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
to_recon_table <- to_recon_rsv %>%
filter(ID_timepoint_spec == i) %>%
select(clonal_size, size)
# fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}
Species richness (Chao1) was calculated using the vegan
r package. The samples were first subsampled 100 times for each
animal/timepoint and then the species richness was estimated. The mean
of those 100x replicates were used for plotting.
Here is the processing and estimation of species richness:
rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))
# subsample and replicate the subsampling 100 times for higher accuracy
chaox100 <- function(x, value_to_subsample) {
replicate(100, {
subsample <- vegan::rrarefy(x, value_to_subsample)
chao <- vegan::estimateR(subsample)
return(chao)
})
}
diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))
# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
print(spec_timepoint)
rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
names(x) <- gsub("_.*", "", names(x))
x
}
# adjusted dataset for plotting
{
chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
# save intermediate file
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
# diversity mean of x100 replicates per animal
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
group_by(algorithm, Timepoint_specificity) %>%
summarise_all(.funs = mean) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)
chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
))
}
Here is the plot and comparison between the vaccinated groups:
chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"
chao_results_df <- chao_results_df %>%
filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
summarise(value = mean(value)) %>%
tidyr::drop_na() %>%
ungroup()
write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)
stat.test <- chao_results_df %>%
wilcox_test(formula = value ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
value | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.285 | ns | 272.3352 | Total_B1…. | 1 | 2 |
value | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 15 | 0.286 | 0.343 | ns | 301.8278 | Total_B2…. | 3 | 4 |
value | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 331.3205 | PreF_B1_…. | 5 | 6 |
value | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 360.8131 | PreF_B2_…. | 7 | 8 |
value | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
390.3058 | DP_B1_20…. | 9 | 10 |
value | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
419.7984 | DP_B2_20…. | 11 | 12 |
chao_results_df %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "value",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Chao1)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)
According to Recon
default settings and tutorial (check
Recon manual), the count files generated previously were used to create
the fitfiles.txt
that were used as an input for generating
the diversity_table.txt
for all the samples.
To generate the the fitfile for each count file, the bash script used in a for loop was:
#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE
python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt
Then, each fit file was used for generating the
diversity_table.txt
with the command:
python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt
The results from diversity_table.txt
for all the samples
were used as input for plotting.
recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
ungroup()
write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
stat.test <- recon_res %>%
wilcox_test(formula = est_0.0D ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
est_0.0D | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 275.7082 | Total_B1…. | 1 | 2 |
est_0.0D | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 304.8120 | Total_B2…. | 3 | 4 |
est_0.0D | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 333.9159 | PreF_B1_…. | 5 | 6 |
est_0.0D | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 363.0197 | PreF_B2_…. | 7 | 8 |
est_0.0D | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
392.1236 | DP_B1_20…. | 9 | 10 |
est_0.0D | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
421.2274 | DP_B2_20…. | 11 | 12 |
recon_res %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Recon)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)
Here the SHM is shown for all the sequences for every animal combined. Data is divided by 20-mer, 1-mer, and specificities. T
rds_indiv_total <- rds_merge %>%
mutate(specificity_group = "Total")
rds_indiv_plot <- rbind(rds_merge, rds_indiv_total)
rds_indiv_plot <- rds_indiv_plot %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup() %>% drop_na(Group_specificity)
stat.test <- rds_indiv_plot %>%
wilcox_test(formula = V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
V_errors | Total_B1_20-mer | Total_B1_1-mer | 71438 | 41890 | 1096663994 | 0 | 0 | **** | 140.760 | Total_B1…. | 1 | 2 |
V_errors | Total_B2_20-mer | Total_B2_1-mer | 141383 | 92712 | 4905379554 | 0 | 0 | **** | 151.272 | Total_B2…. | 3 | 4 |
V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 20774 | 21126 | 196698706 | 0 | 0 | **** | 161.784 | PreF_B1_…. | 5 | 6 |
V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 48561 | 47362 | 964010280 | 0 | 0 | **** | 172.296 | PreF_B2_…. | 7 | 8 |
V_errors | DP_B1_20-mer | DP_B1_1-mer | 38256 | 8444 | 139047744 | 0 | 0 | **** | 182.808 | DP_B1_20…. | 9 | 10 |
V_errors | DP_B2_20-mer | DP_B2_1-mer | 82239 | 33204 | 1055796386 | 0 | 0 | **** | 193.320 | DP_B2_20…. | 11 | 12 |
counts <- rds_indiv_plot %>% group_by(Group, specificity_group) %>% summarise(counts = n())
rds_indiv_plot %>%
ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
labs(y = "# IGHV nucleotide mutations", x= "") +
# stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)
Here the SHM data is summarized per macaque, so each dot represents the average SHM of all the antigen-specific sequences for one animal.
rds_indiv_plot_summ <- rds_indiv_plot %>%
filter(Group_specificity != "PostF") %>%
group_by(ID, Group_specificity) %>%
summarise(avg_V_errors = mean(V_errors, na.rm = TRUE),
sd_V_errors = sd(V_errors, na.rm = TRUE),
n = n()) %>%
ungroup() %>%
tidyr::drop_na()
stat.test <- rds_indiv_plot_summ %>%
wilcox_test(formula = avg_V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
.y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|
avg_V_errors | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 5 | 0.286 | 0.826 | ns | 25.37176 | Total_B1…. | 1 | 2 |
avg_V_errors | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 6 | 0.413 | 0.826 | ns | 26.81147 | Total_B2…. | 3 | 4 |
avg_V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 8 | 0.730 | 0.876 | ns | 28.25118 | PreF_B1_…. | 5 | 6 |
avg_V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 8 | 0.730 | 0.876 | ns | 29.69090 | PreF_B2_…. | 7 | 8 |
avg_V_errors | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 4 | 0.190 | 0.826 | ns | 31.13061 | DP_B1_20…. | 9 | 10 |
avg_V_errors | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 10 | 1.000 | 1.000 | ns | 32.57032 | DP_B2_20…. | 11 | 12 |
counts <- rds_indiv_plot_summ %>% group_by(Group_specificity) %>% summarise(max_n = max(n), min_n = min(n), total_n = sum(n))
rds_indiv_plot_summ %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "avg_V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
stat_summary(fun.y = mean,
geom = "crossbar") +
labs(y = "# IGHV nucleotide mutations", x= "") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 22) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
write.csv(rds_indiv_plot_summ, paste0("../", result.dir, "rsv_specific_mean_SHM_per_animal.csv"), row.names = FALSE)
ggsave(paste0("../", result.dir, "rsv_specific_mean_SHM_per_animal.pdf"), width = 6, height = 3)
rds_indiv_hcdr3 <- rds_indiv %>%
# filter(database == "Individualized", Timepoint == c("B1", "B2")) %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(
Group_specificity = paste(Group, specificity_group, sep = "_"),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
Timepoint = factor(Timepoint, levels = c("B1", "B2"))
) %>% drop_na()
counts <- rds_indiv_hcdr3 %>% group_by(specificity_group, Timepoint, Group) %>% summarise(N=paste0('n = ',n()))
counts
## # A tibble: 12 × 4
## # Groups: specificity_group, Timepoint [6]
## specificity_group Timepoint Group N
## <fct> <fct> <chr> <chr>
## 1 Total B1 1-mer n = 383
## 2 Total B1 20-mer n = 651
## 3 Total B2 1-mer n = 682
## 4 Total B2 20-mer n = 976
## 5 PreF B1 1-mer n = 205
## 6 PreF B1 20-mer n = 205
## 7 PreF B2 1-mer n = 389
## 8 PreF B2 20-mer n = 346
## 9 DP B1 1-mer n = 144
## 10 DP B1 20-mer n = 413
## 11 DP B2 1-mer n = 250
## 12 DP B2 20-mer n = 583
rds_indiv_hcdr3 %>%
ggplot(aes(x = cdr3_aa_length, fill = Group)) +
geom_density(alpha = .7) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
labs(y = "Density", x = "HCDR3 aa length") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
# geom_text(data = counts, aes(label = N, fill = NA), x = Inf, y = Inf, hjust = 1, vjust = 1, size = 4, color = "black") +
facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")
ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)
kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")
joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
## width seq names
## [1] 296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
## [2] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
## [3] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [4] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [5] 298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
## ... ... ...
## [11] 296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12] 299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13] 296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14] 299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15] 297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
type = "Not validated"
) %>%
dplyr::rename(
v_count = v_unique.Freq,
ID = v_unique.Var1
)
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
group_by(ID) %>%
arrange(ID, .by_group = TRUE) %>%
mutate(
diff = v_count - lag(v_count, default = last(v_count)),
diff = ifelse(diff < 0, v_count, diff)
)
kimdb_v_indiv %>%
write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)
kimdb_v_indiv %>%
mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
ggplot(aes(x = ID, y = diff, fill = type)) +
geom_col(color = "black") +
scale_fill_viridis_d(direction = -1) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
scale_x_discrete(expand = c(0, 0)) +
labs(y = "V alelle counts", x = "Animal ID") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)
clonal_tree_data <- clonotype_rsv %>%
ungroup() %>%
distinct(new_name, .keep_all = TRUE) %>%
group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
add_tally(name = "clonal_size") %>%
ungroup()
mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03", "LOR74" = "E11_05_B06")
# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")
# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")
Select all the sequences that are part of the same clonotypes from the mAbs of interest, in this case LOR21, LOR24, and LOR74.The selected sequences were saved as fasta for further multiple sequence alignment and phylogenetic tree inference.
sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA heavy chain
UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
names(UCA) <- paste0(mab_name,"_UCA")
# select sequences part of the same clonotype
group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clonal_tree_data %>%
filter(grp == group)
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
row.names = FALSE)
}
# deduplicating sequences
fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
#adding single cell sequences
sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
#saving duplicated
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
fasta <- c(fasta, sc_seqs, UCA)
fasta <- fasta[unique(names(fasta))]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
## [1] 3
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA light chain
UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
# select light chain clonotypes
group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clono_light_chains %>%
filter(grp == group, substr(name,1,3) == substr(mab,1,3))
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
row.names = FALSE)
}
# save object as BStringset
fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
# save duplicated values
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
fasta <- c(fasta, UCA_LC)
fasta <- fasta[names(fasta)]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
## [1] 3
Do a Multiple Sequence Alignment using MUSCLE
(v 5.1)
for all the fasta files generated through out this analysis. Following
that, run FastTree
(v 2.1.11) for all the aligned sequences
and save tree output to be plotted on the following code.
source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/
for f in *.fasta; do muscle -align $f -output $f.aln; done
for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 217 seqs, avg length 364, max 370
##
## 00:00 4.6Mb CPU has 8 cores, running 8 threads
## 00:00 4.8Mb 0.0043% Calc posteriors
00:01 180Mb 0.96% Calc posteriors
00:02 212Mb 1.9% Calc posteriors
00:03 222Mb 2.8% Calc posteriors
00:04 234Mb 3.8% Calc posteriors
00:05 254Mb 4.4% Calc posteriors
00:06 261Mb 4.7% Calc posteriors
00:07 284Mb 5.2% Calc posteriors
00:08 290Mb 5.8% Calc posteriors
00:09 302Mb 6.3% Calc posteriors
00:10 309Mb 6.9% Calc posteriors
00:11 326Mb 7.3% Calc posteriors
00:12 327Mb 7.8% Calc posteriors
00:13 348Mb 8.4% Calc posteriors
00:14 359Mb 9.0% Calc posteriors
00:15 374Mb 9.6% Calc posteriors
00:16 402Mb 10.0% Calc posteriors
00:17 408Mb 10.5% Calc posteriors
00:18 415Mb 11.0% Calc posteriors
00:19 430Mb 11.7% Calc posteriors
00:20 441Mb 12.1% Calc posteriors
00:21 453Mb 12.6% Calc posteriors
00:22 454Mb 13.0% Calc posteriors
00:23 468Mb 13.5% Calc posteriors
00:24 491Mb 14.0% Calc posteriors
00:25 514Mb 14.6% Calc posteriors
00:26 530Mb 15.3% Calc posteriors
00:27 557Mb 15.7% Calc posteriors
00:28 561Mb 16.2% Calc posteriors
00:29 581Mb 16.7% Calc posteriors
00:30 606Mb 17.3% Calc posteriors
00:31 622Mb 17.8% Calc posteriors
00:32 648Mb 18.4% Calc posteriors
00:33 667Mb 19.0% Calc posteriors
00:34 671Mb 19.3% Calc posteriors
00:35 691Mb 19.9% Calc posteriors
00:36 706Mb 20.4% Calc posteriors
00:37 714Mb 20.8% Calc posteriors
00:38 729Mb 21.3% Calc posteriors
00:39 730Mb 21.7% Calc posteriors
00:40 748Mb 22.2% Calc posteriors
00:41 761Mb 22.7% Calc posteriors
00:42 783Mb 23.2% Calc posteriors
00:43 801Mb 23.8% Calc posteriors
00:44 819Mb 24.3% Calc posteriors
00:45 805Mb 24.6% Calc posteriors
00:46 802Mb 25.0% Calc posteriors
00:47 790Mb 25.5% Calc posteriors
00:48 809Mb 26.1% Calc posteriors
00:49 700Mb 26.7% Calc posteriors
00:50 711Mb 27.3% Calc posteriors
00:51 729Mb 27.7% Calc posteriors
00:52 741Mb 28.3% Calc posteriors
00:53 764Mb 28.9% Calc posteriors
00:54 679Mb 29.7% Calc posteriors
00:55 686Mb 30.4% Calc posteriors
00:56 690Mb 31.1% Calc posteriors
00:57 701Mb 31.6% Calc posteriors
00:58 687Mb 32.2% Calc posteriors
00:59 691Mb 32.7% Calc posteriors
01:00 698Mb 33.4% Calc posteriors
01:01 710Mb 34.0% Calc posteriors
01:02 725Mb 34.7% Calc posteriors
01:03 726Mb 35.1% Calc posteriors
01:04 751Mb 35.6% Calc posteriors
01:05 753Mb 36.3% Calc posteriors
01:06 772Mb 36.9% Calc posteriors
01:07 780Mb 37.5% Calc posteriors
01:08 790Mb 38.0% Calc posteriors
01:09 804Mb 38.6% Calc posteriors
01:10 814Mb 38.9% Calc posteriors
01:11 828Mb 39.5% Calc posteriors
01:12 736Mb 40.1% Calc posteriors
01:13 759Mb 40.8% Calc posteriors
01:14 765Mb 41.4% Calc posteriors
01:15 768Mb 41.9% Calc posteriors
01:16 779Mb 42.7% Calc posteriors
01:17 791Mb 43.5% Calc posteriors
01:18 803Mb 44.1% Calc posteriors
01:19 813Mb 44.7% Calc posteriors
01:20 817Mb 45.2% Calc posteriors
01:21 825Mb 45.8% Calc posteriors
01:22 734Mb 46.4% Calc posteriors
01:23 749Mb 47.0% Calc posteriors
01:24 765Mb 47.6% Calc posteriors
01:25 827Mb 48.2% Calc posteriors
01:26 839Mb 48.7% Calc posteriors
01:27 841Mb 49.1% Calc posteriors
01:28 844Mb 49.7% Calc posteriors
01:29 851Mb 50.3% Calc posteriors
01:30 858Mb 50.9% Calc posteriors
01:31 853Mb 51.3% Calc posteriors
01:32 856Mb 51.7% Calc posteriors
01:33 856Mb 52.1% Calc posteriors
01:34 852Mb 52.4% Calc posteriors
01:33 854Mb 52.7% Calc posteriors
01:34 852Mb 52.8% Calc posteriors
01:35 861Mb 53.1% Calc posteriors
01:36 874Mb 53.5% Calc posteriors
01:37 878Mb 53.9% Calc posteriors
01:38 879Mb 54.2% Calc posteriors
01:39 880Mb 54.5% Calc posteriors
01:40 786Mb 54.9% Calc posteriors
01:41 796Mb 55.4% Calc posteriors
01:42 806Mb 56.1% Calc posteriors
01:43 825Mb 56.7% Calc posteriors
01:44 844Mb 57.3% Calc posteriors
01:45 848Mb 57.7% Calc posteriors
01:46 849Mb 58.3% Calc posteriors
01:47 854Mb 59.0% Calc posteriors
01:48 860Mb 59.6% Calc posteriors
01:49 866Mb 60.2% Calc posteriors
01:50 882Mb 60.7% Calc posteriors
01:51 770Mb 61.1% Calc posteriors
01:52 780Mb 61.8% Calc posteriors
01:53 774Mb 62.3% Calc posteriors
01:54 783Mb 62.8% Calc posteriors
01:55 790Mb 63.4% Calc posteriors
01:56 799Mb 63.9% Calc posteriors
01:57 809Mb 64.3% Calc posteriors
01:58 823Mb 64.8% Calc posteriors
01:59 830Mb 65.4% Calc posteriors
02:00 828Mb 65.9% Calc posteriors
02:01 822Mb 66.4% Calc posteriors
02:02 823Mb 66.7% Calc posteriors
02:03 849Mb 67.1% Calc posteriors
02:04 860Mb 67.6% Calc posteriors
02:05 872Mb 68.3% Calc posteriors
02:06 880Mb 68.8% Calc posteriors
02:07 900Mb 69.2% Calc posteriors
02:08 905Mb 69.7% Calc posteriors
02:09 917Mb 70.3% Calc posteriors
02:10 923Mb 70.8% Calc posteriors
02:11 924Mb 71.3% Calc posteriors
02:12 930Mb 71.8% Calc posteriors
02:13 813Mb 72.4% Calc posteriors
02:14 709Mb 72.9% Calc posteriors
02:15 734Mb 73.6% Calc posteriors
02:16 744Mb 74.1% Calc posteriors
02:17 753Mb 74.7% Calc posteriors
02:18 754Mb 75.3% Calc posteriors
02:19 765Mb 75.7% Calc posteriors
02:20 763Mb 76.3% Calc posteriors
02:21 771Mb 76.9% Calc posteriors
02:22 784Mb 77.5% Calc posteriors
02:23 791Mb 78.2% Calc posteriors
02:24 789Mb 78.8% Calc posteriors
02:25 792Mb 79.4% Calc posteriors
02:26 806Mb 80.1% Calc posteriors
02:27 805Mb 80.7% Calc posteriors
02:28 789Mb 81.2% Calc posteriors
02:29 802Mb 81.8% Calc posteriors
02:30 815Mb 82.5% Calc posteriors
02:31 687Mb 83.3% Calc posteriors
02:32 697Mb 83.9% Calc posteriors
02:33 700Mb 84.5% Calc posteriors
02:34 718Mb 85.2% Calc posteriors
02:35 741Mb 85.8% Calc posteriors
02:36 770Mb 86.3% Calc posteriors
02:37 763Mb 86.8% Calc posteriors
02:38 761Mb 87.2% Calc posteriors
02:39 762Mb 87.8% Calc posteriors
02:40 773Mb 88.4% Calc posteriors
02:41 789Mb 88.9% Calc posteriors
02:42 804Mb 89.5% Calc posteriors
02:43 808Mb 90.1% Calc posteriors
02:44 811Mb 90.6% Calc posteriors
02:45 835Mb 91.4% Calc posteriors
02:46 860Mb 92.2% Calc posteriors
02:47 884Mb 93.0% Calc posteriors
02:48 897Mb 93.5% Calc posteriors
02:49 897Mb 93.9% Calc posteriors
02:50 899Mb 94.3% Calc posteriors
02:51 897Mb 94.9% Calc posteriors
02:52 906Mb 95.2% Calc posteriors
02:53 783Mb 95.5% Calc posteriors
02:54 784Mb 96.0% Calc posteriors
02:55 784Mb 96.3% Calc posteriors
02:56 788Mb 96.9% Calc posteriors
02:57 792Mb 97.3% Calc posteriors
02:58 795Mb 97.8% Calc posteriors
02:59 815Mb 98.2% Calc posteriors
03:00 777Mb 98.4% Calc posteriors
03:01 776Mb 98.9% Calc posteriors
03:02 792Mb 99.3% Calc posteriors
03:03 798Mb 99.8% Calc posteriors
03:03 799Mb 100.0% Calc posteriors
## 03:03 799Mb 0.0043% Consistency (1/2)
03:04 835Mb 1.3% Consistency (1/2)
03:05 841Mb 5.0% Consistency (1/2)
03:06 799Mb 8.9% Consistency (1/2)
03:07 799Mb 13.8% Consistency (1/2)
03:08 803Mb 19.7% Consistency (1/2)
03:09 815Mb 26.8% Consistency (1/2)
03:10 825Mb 32.8% Consistency (1/2)
03:11 834Mb 37.9% Consistency (1/2)
03:12 843Mb 43.0% Consistency (1/2)
03:13 790Mb 48.0% Consistency (1/2)
03:14 793Mb 53.0% Consistency (1/2)
03:15 806Mb 60.8% Consistency (1/2)
03:16 820Mb 68.6% Consistency (1/2)
03:17 820Mb 73.1% Consistency (1/2)
03:18 830Mb 81.0% Consistency (1/2)
03:19 845Mb 89.6% Consistency (1/2)
03:20 858Mb 97.2% Consistency (1/2)
03:20 863Mb 100.0% Consistency (1/2)
## 03:20 863Mb 0.0043% Consistency (2/2)
03:21 863Mb 5.4% Consistency (2/2)
03:22 863Mb 14.9% Consistency (2/2)
03:23 860Mb 19.5% Consistency (2/2)
03:24 860Mb 28.0% Consistency (2/2)
03:25 860Mb 37.0% Consistency (2/2)
03:26 860Mb 46.3% Consistency (2/2)
03:27 860Mb 55.7% Consistency (2/2)
03:28 859Mb 62.5% Consistency (2/2)
03:29 854Mb 69.9% Consistency (2/2)
03:30 854Mb 78.8% Consistency (2/2)
03:31 852Mb 88.2% Consistency (2/2)
03:32 852Mb 97.5% Consistency (2/2)
03:32 851Mb 100.0% Consistency (2/2)
## 03:32 850Mb 0.46% UPGMA5
03:32 850Mb 100.0% UPGMA5
## 03:32 850Mb 1.0% Refining
03:33 848Mb 11.0% Refining
03:34 816Mb 24.0% Refining
03:35 815Mb 45.0% Refining
03:36 818Mb 67.0% Refining
03:37 819Mb 90.0% Refining
03:37 818Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 8 seqs, avg length 322, max 326
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.4Mb 3.6% Calc posteriors
00:00 80Mb 100.0% Calc posteriors
## 00:00 84Mb 3.6% Consistency (1/2)
00:00 84Mb 100.0% Consistency (1/2)
## 00:00 84Mb 3.6% Consistency (2/2)
00:00 84Mb 100.0% Consistency (2/2)
## 00:00 84Mb 14.3% UPGMA5
00:00 84Mb 100.0% UPGMA5
## 00:00 84Mb 1.0% Refining
00:00 85Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 118 seqs, avg length 370, max 370
##
## 00:00 2.8Mb CPU has 8 cores, running 8 threads
## 00:00 3.1Mb 0.014% Calc posteriors
00:01 83Mb 0.25% Calc posteriors
00:02 168Mb 2.1% Calc posteriors
00:03 215Mb 4.5% Calc posteriors
00:04 227Mb 6.9% Calc posteriors
00:05 242Mb 9.2% Calc posteriors
00:06 247Mb 11.7% Calc posteriors
00:07 251Mb 14.0% Calc posteriors
00:08 267Mb 15.7% Calc posteriors
00:09 285Mb 17.9% Calc posteriors
00:10 287Mb 20.1% Calc posteriors
00:11 302Mb 22.3% Calc posteriors
00:12 311Mb 24.4% Calc posteriors
00:13 327Mb 25.7% Calc posteriors
00:14 342Mb 27.7% Calc posteriors
00:15 382Mb 29.5% Calc posteriors
00:16 391Mb 31.6% Calc posteriors
00:17 409Mb 33.6% Calc posteriors
00:18 435Mb 35.9% Calc posteriors
00:19 437Mb 37.2% Calc posteriors
00:20 468Mb 39.5% Calc posteriors
00:21 478Mb 41.7% Calc posteriors
00:22 501Mb 44.2% Calc posteriors
00:23 527Mb 46.6% Calc posteriors
00:24 568Mb 49.2% Calc posteriors
00:25 576Mb 51.9% Calc posteriors
00:26 613Mb 54.4% Calc posteriors
00:27 628Mb 56.9% Calc posteriors
00:28 651Mb 59.5% Calc posteriors
00:29 676Mb 62.0% Calc posteriors
00:30 703Mb 64.6% Calc posteriors
00:31 720Mb 66.4% Calc posteriors
00:32 759Mb 69.1% Calc posteriors
00:33 815Mb 71.8% Calc posteriors
00:34 842Mb 74.2% Calc posteriors
00:35 858Mb 77.2% Calc posteriors
00:36 881Mb 80.2% Calc posteriors
00:37 883Mb 83.2% Calc posteriors
00:38 905Mb 86.4% Calc posteriors
00:39 799Mb 89.3% Calc posteriors
00:40 842Mb 92.2% Calc posteriors
00:41 858Mb 95.1% Calc posteriors
00:42 862Mb 97.8% Calc posteriors
00:42 869Mb 100.0% Calc posteriors
## 00:42 869Mb 0.014% Consistency (1/2)
00:43 871Mb 4.2% Consistency (1/2)
00:44 906Mb 73.4% Consistency (1/2)
00:44 919Mb 100.0% Consistency (1/2)
## 00:44 919Mb 0.014% Consistency (2/2)
00:45 919Mb 46.4% Consistency (2/2)
00:45 919Mb 100.0% Consistency (2/2)
## 00:45 919Mb 0.85% UPGMA5
00:45 919Mb 100.0% UPGMA5
## 00:45 920Mb 1.0% Refining
00:46 922Mb 22.0% Refining
00:47 926Mb 93.0% Refining
00:47 927Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 10 seqs, avg length 331, max 334
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.4Mb 2.2% Calc posteriors
00:00 92Mb 100.0% Calc posteriors
## 00:00 92Mb 2.2% Consistency (1/2)
00:00 92Mb 100.0% Consistency (1/2)
## 00:00 92Mb 2.2% Consistency (2/2)
00:00 92Mb 100.0% Consistency (2/2)
## 00:00 92Mb 11.1% UPGMA5
00:00 92Mb 100.0% UPGMA5
## 00:00 92Mb 1.0% Refining
00:00 93Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 71 seqs, avg length 370, max 370
##
## 00:00 2.4Mb CPU has 8 cores, running 8 threads
## 00:00 2.6Mb 0.04% Calc posteriors
00:01 159Mb 3.6% Calc posteriors
00:02 201Mb 11.3% Calc posteriors
00:03 236Mb 19.4% Calc posteriors
00:04 238Mb 27.2% Calc posteriors
00:05 253Mb 34.9% Calc posteriors
00:06 289Mb 42.6% Calc posteriors
00:07 316Mb 50.7% Calc posteriors
00:08 323Mb 58.8% Calc posteriors
00:09 336Mb 67.6% Calc posteriors
00:10 346Mb 75.9% Calc posteriors
00:11 361Mb 83.9% Calc posteriors
00:12 382Mb 92.2% Calc posteriors
00:13 404Mb 100.0% Calc posteriors
00:13 404Mb 100.0% Calc posteriors
## 00:13 404Mb 0.04% Consistency (1/2)
00:13 423Mb 100.0% Consistency (1/2)
## 00:13 423Mb 0.04% Consistency (2/2)
00:13 425Mb 100.0% Consistency (2/2)
## 00:13 425Mb 1.4% UPGMA5
00:13 425Mb 100.0% UPGMA5
## 00:13 425Mb 1.0% Refining
00:14 426Mb 8.0% Refining
00:14 429Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 3 seqs, avg length 332, max 334
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.3Mb 33.3% Calc posteriors
00:00 2.5Mb 100.0% Calc posteriors
## 00:00 17Mb 33.3% Consistency (1/2)
00:00 17Mb 100.0% Consistency (1/2)
## 00:00 17Mb 33.3% Consistency (2/2)
00:00 17Mb 100.0% Consistency (2/2)
## 00:00 18Mb 50.0% UPGMA5
00:00 18Mb 100.0% UPGMA5
## 00:00 18Mb 1.0% Refining
00:00 19Mb 100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.06 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
## 0.12 seconds: SPR round 1 of 2, 101 of 432 nodes
## 0.25 seconds: SPR round 1 of 2, 401 of 432 nodes
## 0.38 seconds: SPR round 2 of 2, 201 of 432 nodes
## 0.49 seconds: SPR round 2 of 2, 401 of 432 nodes
## Total branch-length 2.197 after 0.52 sec
## 0.64 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.74
## 0.85 seconds: Optimizing GTR model, step 2 of 12
## 0.95 seconds: Optimizing GTR model, step 3 of 12
## 1.08 seconds: Optimizing GTR model, step 5 of 12
## 1.21 seconds: Optimizing GTR model, step 8 of 12
## 1.32 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
## 1.45 seconds: ML Lengths 1 of 215 splits
## 1.55 seconds: Site likelihoods with rate category 11 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 1.69 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.81
## Turning off heuristics for final round of ML NNIs (converged)
## 1.81 seconds: ML NNI round 3 of 16, 1 of 215 splits
## 1.92 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
## 2.03 seconds: ML NNI round 3 of 16, 201 of 215 splits, 57 changes (max delta 1.019)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 2.06 (final)
## Optimize all lengths: LogLk = -5298.540 Time 2.12
## 2.26 seconds: ML split tests for 100 of 214 internal splits
## 2.41 seconds: ML split tests for 200 of 214 internal splits
## Total time: 2.44 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.03
## Total time: 0.04 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.04 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
## 0.12 seconds: SPR round 1 of 2, 101 of 234 nodes
## 0.27 seconds: SPR round 2 of 2, 101 of 234 nodes
## Total branch-length 1.596 after 0.35 sec
## 0.38 seconds: ML Lengths 101 of 116 splits
## 0.51 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.53
## 0.62 seconds: Optimizing GTR model, step 4 of 12
## 0.74 seconds: Optimizing GTR model, step 8 of 12
## 0.85 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
## 0.95 seconds: Site likelihoods with rate category 1 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 1.13 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 1.16
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 1.23
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 1.28
## Turning off heuristics for final round of ML NNIs (converged)
## 1.28 seconds: ML NNI round 5 of 14, 1 of 116 splits
## 1.42 seconds: ML NNI round 5 of 14, 101 of 116 splits, 0 changes
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 1.46 (final)
## Optimize all lengths: LogLk = -3814.284 Time 1.52
## 1.69 seconds: ML split tests for 100 of 115 internal splits
## Total time: 1.72 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.01 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.04
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.05 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.05
## Total time: 0.06 seconds Unique: 9/10 Bad splits: 0/6
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.01 seconds
## Refining topology: 25 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs
## 0.11 seconds: SPR round 2 of 2, 101 of 140 nodes
## Total branch-length 0.623 after 0.13 sec
## ML-NNI round 1: LogLk = -2035.235 NNIs 32 max delta 1.30 Time 0.19
## 0.22 seconds: Optimizing GTR model, step 3 of 12
## 0.32 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2003 0.2812 0.3012 0.2172
## GTR rates(ac ag at cg ct gt) 1.3353 1.7821 0.7868 0.4971 1.3891 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.772 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -1916.232 NNIs 8 max delta 0.00 Time 0.46
## Turning off heuristics for final round of ML NNIs (converged)
## 0.46 seconds: ML NNI round 3 of 12, 1 of 69 splits
## ML-NNI round 3: LogLk = -1916.232 NNIs 1 max delta 0.00 Time 0.54 (final)
## Optimize all lengths: LogLk = -1916.232 Time 0.56
## Total time: 0.66 seconds Unique: 71/71 Bad splits: 0/68
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs
## Total branch-length 0.053 after 0.00 sec
## ML-NNI round 1: LogLk = -560.888 NNIs 0 max delta 0.00 Time 0.00
## GTR Frequencies: 0.1960 0.3050 0.2620 0.2370
## GTR rates(ac ag at cg ct gt) 3.2025 7.6580 2.7938 2.4124 1.8012 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.635 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00 (final)
## Optimize all lengths: LogLk = -548.791 Time 0.00
## Total time: 0.01 seconds Unique: 3/3 Bad splits: 0/0
Generated trees arre edited using treeio
and plotted
with ggtree
. The trees were rerooted to their respectives
Unmutated Common Ancestor (UCA), which in this case is defined as just
the V and J gene germlines combined. The gap between V-J is inserted
automatically by the alignment method, thus the CDR3 here is not
considered for the UCA.
# function to root tree in UCA
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}
ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)
names(ls) <- lapply(ls, function(x) {
if (stringr::str_count(x, "LOR") > 1) {
substr(x, 24, 34)
}else if (grepl("LC",x)) {
substr(x, 24, 31)
}else if (grepl("LOR",x)) {
substr(x, 24, 28)
}else{
substr(x, 24,33)
}})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)
for(i in seq_along(plots)){
print(i)
plot_name <- names(plots)[i]
plots_edit <- plots[[i]]$data %>%
mutate(new_label = ifelse(grepl("sc_|LOR|E11_TR1-B2-igg_M05494_64_000000000-CRY9F_1_2115_14487_10780",label), gsub("sc_", "",label), ""),
new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
warn_missing = FALSE),
new_label = ifelse(grepl("LOR",new_label), new_label, ""),
name = label,
timepoint = case_when(grepl("B2-igg",name) ~ "B2",
grepl("B1-igg",name) ~ "B1",
grepl("sc_|LOR",name) ~ "single_cell",
grepl("igm",name) ~ "PV",
TRUE ~ "intersects"),
name = gsub("sc_", "", label))
if(stringr::str_count(plot_name, "LOR") > 1){
plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name")
}else{
plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name")
}
# shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")
gg_plot <- ggtree(plots_edit,aes(color = timepoint)) +
{if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
else geom_tippoint(aes(size = duplicated), shape = 18)}+
# geom_tippoint(aes(size = duplicated), shape = 23) +
geom_tiplab(aes(label = new_label), hjust = -.2) +
labs(size = "Count", shape = "Shape", color = "Timepoint") +
scale_color_manual(values = colors_timepoints) +
coord_cartesian(clip = 'off') +
# scale_shape_manual(values = shapes_timepoints) +
scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
geom_treescale(width = .05)
print(gg_plot)
ggsave(gg_plot, filename = paste0("../", result.dir, names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 2, 4), height = ifelse(grepl("LC", plot_name), 2, 6))
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
Query LOR24 amino acid sequence in bulk repertoire sequencing from each animal. Query was done without alignment due to large number of sequences, but string distance was calculated using Levenstein distance.
The entire dataset was loaded, filtered and used to calculate
distances to LOR24. The output was generated and saved in
.rds
format to avoid recaculation for each run.
processed_data <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")
processed_data <- processed_data %>% select(grp, name, ID, timepoint, new_name, name, V_SHM, VDJ_aa)
LOR24aa <- "QLQLQESGPGLVKSSETLPLTCAVSGDSISSSYWSWIRQAPGKGLEWIGYIYGSGSYSHYNPSLKSRVTLSVDTSKNQFFLRLNSVTVADTAVYYCARGGRGNTYSWNRFDVWGPGVLVTVSS"
# calculate string distance between LOR24 and all the sequences
identity_matrix <- stringdist::stringdist(gsub("\\*", "",processed_data$VDJ_aa), LOR24aa, method = "lv", nthread = 8)
# calculate the percentage of identity, normalize by LOR24 length
processed_data <- processed_data %>%
mutate(identity = (1-identity_matrix/nchar(LOR24aa)) * 100) %>%
select(V_SHM, identity, ID)
saveRDS(processed_data, "../data/clonotypes/individualized/lor24_identity_calculation.rds")
The processed dataset above was used as an input to plot a 2d-histogram. This plot includes the somatic hypermutation percentage vs the identity to LOR24 amino acid sequences. In essence, this plot shows how a set of sequences across different non-human primates were evolving to become more identical to LOR24.
processed_data <- readRDS("../data/clonotypes/individualized/lor24_identity_calculation.rds")
# plot a 2d-histogram of mutations and identity to LOR24
processed_data %>%
filter(identity >= 70) %>%
ggplot(aes(x = V_SHM, y = identity)) +
geom_bin2d(bins = 30) +
scale_fill_viridis_c(trans = "log10") +
theme_prism() +
# scale_y_continuous(expand=c(0,1),limits=c(0,101)) +
labs(x = "% Divergence from germline", y = paste0("% Identity to LOR24aa"),
fill = "Sequence counts\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12)) +
facet_wrap(~ID)
ggsave(filename = paste0("../", result.dir,"LOR24_2d-histogram.pdf"))
LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.
data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
data_comp_auc$IC50_RSV_A2 <- log10(data_comp_auc$IC50_RSV_A2)
# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity", "IC50_RSV_A2")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
"PreF/PostF" = "#CD87F8",
"PostF" = "#92CDD6",
"w.b." = "grey20"), epitope = c(
"Ø" = "#F3B084",
"V" = "salmon",
"III" = "#A9D08D",
"IV" = "#FAD0FF",
"II" = "#FFD966",
"I" = "#9BC1E6",
"UK-Pre" = "#C9C9C9",
"UK-DP" = "#8497B0" ,
"WB" = "grey95",
"PostF" = "black",
"foldon" = "grey40"))
{
g_heatmap_scale <- data_comp_auc %>%
select(-all_of(to_remove)) %>%
mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
t() %>%
pheatmap::pheatmap(scale = "none", angle_col = 90, cutree_cols = 8,
clustering_method = "ward.D",
color = viridisLite::viridis(100),
cluster_rows = FALSE, border_color = "grey40",
cluster_cols = TRUE,
legend_breaks = c(0, 0.5, 1, 1.5, max(.)),
legend_labels = c("0", "0.5", "1", "1.5", "Normalized\ncompetition\n"),
annotation_col = data_comp_auc[,13:14],
annotation_colors = annot_colors,
cellwidth = 5, cellheight = 5, fontsize_row = 5, fontsize_col = 6, fontsize = 6)
ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 18, height = 12, units = "cm")
}
This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.
# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))
data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name")
to_remove <- c("epitope", "specificity", "IC50_RSV_A2")
# compute MDS
mds_scaled <- data_comp_auc %>%
select(-all_of(c(to_remove, "name"))) %>%
scale(center = TRUE, scale = TRUE) %>%
dist(method = "euclidean") %>%
cmdscale() %>%
as_tibble()
colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
p1 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_manual(values = annot_colors[["epitope"]]) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1)
plotly::ggplotly(p1)
ggsave(plot = p1, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-sites.pdf"))
p2 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = IC50_RSV_A2)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_viridis_c() +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
labs(color = "IC50 RSV\n(log10)")+
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1)
plotly::ggplotly(p2)
ggsave(plot = p2, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-neuts.pdf"))
mabs_lors <- mabs_lors %>%
filter(name %in% lor_mabs$well_ID) %>%
mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
arrange(mAbs_ID) %>%
relocate(mAbs_ID)
mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp),
mabs_lors[mabs_lors$name == i,]$mAbs_ID,
NA)
}
col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])
mabs_clonal_rank <- mabs_clonal_rank %>%
mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
arrange(LOR) %>%
mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
filter(ID == well_ID)
mabs_lors <- mabs_lors %>%
mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)
# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2, clonal size sc = 1), LOR40 (not detected B2, clonal size sc = 1 ), LOR42 (same clone as LOR01), LOR94 (same clone as LOR01), LOR96 (not detected B2, clonal size sc = 4)
#mabs_lors %>% write.csv(paste0("../data/single_cell/LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(file = paste0("../data/single_cell/LOR_mAbs_info.csv"), sep = ",")
mabs_lors %>%
left_join(data_comp_auc %>% tibble::rowid_to_column("mAbs_ID") %>% mutate(mAbs_ID = as.character(name)), by = "mAbs_ID") %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = epitope)) +
geom_point(size = 3) +
scale_color_manual(values = annot_colors[["epitope"]]) +
scale_y_log10() +
scale_x_log10() +
labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
merged_mabs_lors <- mabs_lors %>%
left_join(clonotype_rsv, by = "name") %>%
left_join(clono_light_chains,suffix = c("_HC","_LC"), by = "name") %>%
mutate(v_call_HC = V_gene.x) %>%
fill(v_call_LC)
write.csv(merged_mabs_lors, file = "../data/single_cell/LOR_mAbs_info_full.csv", row.names = F)
merged_mabs_lors %>%
group_by(v_call_HC, v_call_LC) %>%
mutate(mabs_counts = n()) %>%
ggplot(aes(x= v_call_HC, y = v_call_LC, fill = mabs_counts)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1) +
scale_y_discrete(position = "right") +
labs(fill = "mAb counts", x = "IGHV alleles", y = "IGHJ alleles") +
coord_equal() +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.text.x = element_text(angle = 90, size = 6, hjust = 1,
vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 6),
strip.text = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "right",
axis.ticks = element_line(size = .5),
legend.title = element_text())
renv::snapshot()
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.3.4 dplyr_1.1.0 ggprism_1.0.4
## [4] scales_1.2.1 vegan_2.6-4 lattice_0.20-45
## [7] permute_0.9-7 data.table_1.14.8 Biostrings_2.66.0
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0 IRanges_2.32.0
## [13] S4Vectors_0.36.0 BiocGenerics_0.44.0 ggpubr_0.6.0
## [16] ggplot2_3.4.1 rstatix_0.7.2 ggtree_3.6.0
## [19] treeio_1.22.0 tidyr_1.3.0 jsonlite_1.8.4
##
## loaded via a namespace (and not attached):
## [1] ggVennDiagram_1.2.2 colorspace_2.1-0 ggsignif_0.6.4
## [4] ellipsis_0.3.2 class_7.3-21 aplot_0.1.10
## [7] rstudioapi_0.14 proxy_0.4-27 farver_2.1.1
## [10] ggrepel_0.9.3 fansi_1.0.4 xml2_1.3.3
## [13] splines_4.2.2 cachem_1.0.7 knitr_1.42
## [16] broom_1.0.4 cluster_2.1.4 pheatmap_1.0.12
## [19] compiler_4.2.2 httr_1.4.5 backports_1.4.1
## [22] Matrix_1.5-3 fastmap_1.1.1 lazyeval_0.2.2
## [25] cli_3.6.0 htmltools_0.5.4 tools_4.2.2
## [28] gtable_0.3.1 glue_1.6.2 GenomeInfoDbData_1.2.9
## [31] Rcpp_1.0.10 carData_3.0-5 jquerylib_0.1.4
## [34] vctrs_0.5.2 ape_5.7 svglite_2.1.1
## [37] nlme_3.1-162 crosstalk_1.2.0 xfun_0.37
## [40] stringr_1.5.0 rvest_1.0.3 lifecycle_1.0.3
## [43] zlibbioc_1.44.0 MASS_7.3-58.3 parallel_4.2.2
## [46] RColorBrewer_1.1-3 yaml_2.3.7 gridExtra_2.3
## [49] ggfun_0.0.9 yulab.utils_0.0.6 sass_0.4.5
## [52] stringi_1.7.12 highr_0.10 tidytree_0.4.2
## [55] e1071_1.7-13 rlang_1.0.6 pkgconfig_2.0.3
## [58] systemfonts_1.0.4 bitops_1.0-7 evaluate_0.20
## [61] purrr_1.0.1 sf_1.0-7 patchwork_1.1.2
## [64] htmlwidgets_1.6.1 labeling_0.4.2 cowplot_1.1.1
## [67] tidyselect_1.2.0 plyr_1.8.8 magrittr_2.0.3
## [70] R6_2.5.1 generics_0.1.3 DBI_1.1.3
## [73] pillar_1.8.1 withr_2.5.0 mgcv_1.8-42
## [76] units_0.8-1 abind_1.4-5 RCurl_1.98-1.10
## [79] tibble_3.2.0 crayon_1.5.2 car_3.1-1
## [82] KernSmooth_2.23-20 utf8_1.2.3 plotly_4.10.1
## [85] RVenn_1.1.0 rmarkdown_2.20 grid_4.2.2
## [88] digest_0.6.31 classInt_0.4-9 webshot_0.5.4
## [91] gridGraphics_0.5-1 munsell_0.5.0 viridisLite_0.4.1
## [94] ggplotify_0.1.0 bslib_0.4.2