1 Abstract

This Rmarkdown provides the analysis for the processed data in the paper entitled “Multivalent Antigen Display on Nanoparticles Diversifies B Cell Responses”. It includes the B cell receptors analysis, plots, and generation of intermediate files for plotting using other programs. The processed files used here are AIRR-compliant datasets which can be downloaded from Zenodo 7895251), they are either Human Respiratory Syncytial Virus-specific (RSV-specific) or the full bulk repertoire from the vaccinated non-human primates (Macaca mulatta).

2 Needed libraries

library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)

3 Load data

The data used here is stored as .rds format or as a table (.csv or .tsv) and it can be found in the Zenodo repository 7895251).

data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
  mutate(group = case_when(group == "NP 100%" ~ "20-mer",
                           group == "NP 50%" ~ "10-mer",
                           group == "Sol" ~ "1-mer",
                           group == "PostF" ~ "PostF"))

clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")


# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
  filter(timepoint != "PV") %>%
  mutate(cdr3_aa_length = nchar(CDR3_aa),
         cdr3_aa = CDR3_aa,
         v_call = V_gene,
         j_call = J_gene,
         d_call = D_gene) 

# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
 
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))

# load mAbs data
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")

# load competition ELISA data
# edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-04-28_LOR_norm-dAUC.csv", header = T, row.names = 1, encoding="UTF-8") %>%
  replace(is.na(.), 0) %>%
  mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF", "w.b.")),
         epitope = factor(epitope, levels = c("Ø","V", "IV","III", "II","I", "foldon", "UK-Pre","UK-DP","PostF", "WB")))

# load light chains information clonotyped
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE)

4 Plasma profiling

The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Here we show this data using different visualization methods.

4.1 Multidimensional scaling (MDS)

Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.

cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>% 
  select(-c(timepoints, ID, group)) %>%
  as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)

mds <- D_sim %>%       
  cmdscale(3) %>%
  as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")

mds <- mds %>%
  mutate(group = x$group,
         timepoints = x$timepoints,
         ID = x$ID)
return(mds)
}

4.2 Plasma profiling with MDS divided by timepoint

The code below plots 4 different MDS plots. They are divided between 2 weeks after Boost 1 (B1) and Boost 2 (B2), with or without animals vaccinated with PostF immunogen. On the top of each plot, it is written B1 or B2, and legends say if PostF animals are included or not. The cosine distance was calculated separately using the custom cosine_distance() provided above, for that reason, position of points should not be compared directly between each plot but rather with points within each plot.

# calculate cosine distance and generate MDS with and without PostF
mds_b1 <- data_comp %>% filter(timepoints == "B1") %>% cosine_distance()
mds_b2 <- data_comp %>% filter(timepoints == "B2") %>% cosine_distance()
mds_no_postf_b1 <- data_comp %>% filter(group != "PostF", timepoints == "B1") %>% cosine_distance()
mds_no_postf_b2 <- data_comp %>% filter(group != "PostF", timepoints == "B2") %>% cosine_distance()
ls_mds <- list(mds_b1, mds_b2, mds_no_postf_b1, mds_no_postf_b2)

plot_list <- list()
for(f in seq_along(ls_mds)){
# Plot and color by groups
    mds_plot <- ls_mds[[f]]
    if(f %in% c(2,4)){
      mds_plot$Dim.1 <- mds_plot$Dim.1 * -1 # flip axis in first dimension
    }
    
    plot <- mds_plot %>%
      ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) + 
      geom_point(size = 4, shape = 21) +
      scale_fill_manual(values = fill_col_values) +
      scale_color_manual(values = color_values) +
      lims(x = c(-.5,.5), y = c(-.4, .4)) + 
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      theme(axis.ticks = element_line(size = .5),
            legend.position = c(.2,.2),
            legend.background = element_blank(),
            legend.box.background = element_rect(colour = "black")) +
      facet_wrap(~timepoints)
    
    plot_list[[f]] <- plot
    ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 4, height = 4)
  }

ggarrange(plot_list[[3]], plot_list[[4]], plot_list[[1]], plot_list[[2]], ncol = 2, nrow = 2, common.legend = TRUE,legend = "top", legend.grob = get_legend(plot_list[[2]] + theme(legend.position = "top")))

4.3 Competition titers as bar plots

The code below generates bar plots to visualize the competition titers for epitopes on the PreF or PostF RSV antigen. The code starts by processing the data and categorizing the epitopes based on specific markers. It then defines various pairwise comparisons for statistical testing between different groups for each timepoint with fdr correction. Afterward, it performs a Wilcoxon rank-sum test and displays the results in a tabular format. Finally, the code creates a series of bar plots for each epitope, showing the competition titers at different timepoints, with data points represented as colored points, and a summary of the mean values indicated by crossbars. The plots are grouped based on the specific epitopes, providing an insightful visualization of the competition titer dynamics for different epitopes on the PreF antigen.

The code focused solely on the statistical tests, analysis, and p-value correction can be seen on the rsv_statistical_analysis.Rmd or rendered as a website here.

4.3.1 Competition for epitopes on Pref

data_comp_longer <- data_comp %>%
  pivot_longer(cols = 2:9, names_to = "mAb", values_to = "ELISA_competition") %>%
  mutate(epitopes = plyr::mapvalues(mAb, 
                                    from = c("D25.PreF", "MPE8.PreF", "ADI.PreF", "Pali.PreF", "X101F.PreF",
                                             "ADI.PostF", "X101F.PostF", "Pali.PostF"), 
                                    to = c("Ø", "III", "I", "II", "IV",
                                           "I", "IV", "II")),
         epitopes = factor(epitopes, levels = c("Ø", "III", "IV", "II", "I")),
         conformation = factor(case_when(grepl("PostF", mAb) ~ "PostF",
                                  grepl("PreF", mAb) ~ "PreF"), levels = c("PreF", "PostF")),
        timepoint_group = paste(timepoints, group, sep = "_"),
        timepoint_group_epitope = paste(timepoints, group, epitopes, sep = "_"))

my_comparisons_sites <- list(c("B1_1-mer", "B1_20-mer"),
                       c("B1_1-mer", "B1_10-mer"),
                       c("B1_1-mer", "B1_PostF"),
                       c("B1_10-mer", "B1_20-mer"),
                       c("B1_10-mer", "B1_PostF"),
                       c("B1_20-mer", "B1_PostF"),
                       c("B2_1-mer", "B2_20-mer"),
                       c("B2_1-mer", "B2_10-mer"),
                       c("B2_1-mer", "B2_PostF"),
                       c("B2_10-mer", "B2_20-mer"),
                       c("B2_10-mer", "B2_PostF"),
                       c("B2_20-mer", "B2_PostF"))

my_comparisons_1 <- lapply(my_comparisons_sites, paste0, "_I")
my_comparisons_2 <- lapply(my_comparisons_sites, paste0, "_II")
my_comparisons_3 <- lapply(my_comparisons_sites, paste0, "_III")
my_comparisons_4 <- lapply(my_comparisons_sites, paste0, "_IV")
my_comparisons_5 <- lapply(my_comparisons_sites, paste0, "_Ø")

# Selecting sites for statistical comparison only between groups for each timepoint
# removed statistical comparisons for Boost 1 site I in PreF since most values were below the threshold of detection
my_comparisons_preF <- c(my_comparisons_1[7:12], my_comparisons_2, my_comparisons_3, my_comparisons_4, my_comparisons_5)
my_comparisons_postF <- c(my_comparisons_1, my_comparisons_2, my_comparisons_4)
  

stat.test <- data_comp_longer %>%
  filter(conformation == "PreF") %>%
  wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope, 
              paired = FALSE, 
              p.adjust.method = "fdr", 
              comparisons = my_comparisons_preF)

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif
ELISA_competition	B2_1-mer_I	B2_20-mer_I	4	5	14.0	0.413	0.812	ns
ELISA_competition	B2_1-mer_I	B2_10-mer_I	4	5	20.0	0.020	0.176	ns
ELISA_competition	B2_1-mer_I	B2_PostF_I	4	4	0.0	0.029	0.190	ns
ELISA_competition	B2_10-mer_I	B2_20-mer_I	5	5	6.0	0.205	0.527	ns
ELISA_competition	B2_10-mer_I	B2_PostF_I	5	4	0.0	0.020	0.176	ns
ELISA_competition	B2_20-mer_I	B2_PostF_I	5	4	2.0	0.064	0.343	ns
ELISA_competition	B1_1-mer_II	B1_20-mer_II	4	5	3.0	0.111	0.428	ns
ELISA_competition	B1_1-mer_II	B1_10-mer_II	4	5	8.0	0.730	0.939	ns
ELISA_competition	B1_1-mer_II	B1_PostF_II	4	4	12.0	0.343	0.772	ns
ELISA_competition	B1_10-mer_II	B1_20-mer_II	5	5	6.0	0.222	0.545	ns
ELISA_competition	B1_10-mer_II	B1_PostF_II	5	4	16.0	0.190	0.513	ns
ELISA_competition	B1_20-mer_II	B1_PostF_II	5	4	19.0	0.032	0.190	ns
ELISA_competition	B2_1-mer_II	B2_20-mer_II	4	5	11.0	0.905	0.958	ns
ELISA_competition	B2_1-mer_II	B2_10-mer_II	4	5	11.0	0.905	0.958	ns
ELISA_competition	B2_1-mer_II	B2_PostF_II	4	4	6.0	0.686	0.939	ns
ELISA_competition	B2_10-mer_II	B2_20-mer_II	5	5	9.0	0.548	0.883	ns
ELISA_competition	B2_10-mer_II	B2_PostF_II	5	4	6.0	0.413	0.812	ns
ELISA_competition	B2_20-mer_II	B2_PostF_II	5	4	9.0	0.905	0.958	ns
ELISA_competition	B1_1-mer_III	B1_20-mer_III	4	5	7.0	0.556	0.883	ns
ELISA_competition	B1_1-mer_III	B1_10-mer_III	4	5	8.0	0.730	0.939	ns
ELISA_competition	B1_1-mer_III	B1_PostF_III	4	4	13.5	0.124	0.446	ns
ELISA_competition	B1_10-mer_III	B1_20-mer_III	5	5	13.0	1.000	1.000	ns
ELISA_competition	B1_10-mer_III	B1_PostF_III	5	4	20.0	0.018	0.176	ns
ELISA_competition	B1_20-mer_III	B1_PostF_III	5	4	20.0	0.018	0.176	ns
ELISA_competition	B2_1-mer_III	B2_20-mer_III	4	5	4.0	0.190	0.513	ns
ELISA_competition	B2_1-mer_III	B2_10-mer_III	4	5	11.0	0.905	0.958	ns
ELISA_competition	B2_1-mer_III	B2_PostF_III	4	4	11.0	0.486	0.883	ns
ELISA_competition	B2_10-mer_III	B2_20-mer_III	5	5	8.0	0.421	0.812	ns
ELISA_competition	B2_10-mer_III	B2_PostF_III	5	4	11.0	0.905	0.958	ns
ELISA_competition	B2_20-mer_III	B2_PostF_III	5	4	16.0	0.190	0.513	ns
ELISA_competition	B1_1-mer_IV	B1_20-mer_IV	4	5	4.0	0.190	0.513	ns
ELISA_competition	B1_1-mer_IV	B1_10-mer_IV	4	5	6.0	0.413	0.812	ns
ELISA_competition	B1_1-mer_IV	B1_PostF_IV	4	4	8.0	1.000	1.000	ns
ELISA_competition	B1_10-mer_IV	B1_20-mer_IV	5	5	11.0	0.841	0.958	ns
ELISA_competition	B1_10-mer_IV	B1_PostF_IV	5	4	13.0	0.556	0.883	ns
ELISA_competition	B1_20-mer_IV	B1_PostF_IV	5	4	13.0	0.556	0.883	ns
ELISA_competition	B2_1-mer_IV	B2_20-mer_IV	4	5	8.0	0.730	0.939	ns
ELISA_competition	B2_1-mer_IV	B2_10-mer_IV	4	5	11.0	0.905	0.958	ns
ELISA_competition	B2_1-mer_IV	B2_PostF_IV	4	4	7.0	0.886	0.958	ns
ELISA_competition	B2_10-mer_IV	B2_20-mer_IV	5	5	4.0	0.095	0.428	ns
ELISA_competition	B2_10-mer_IV	B2_PostF_IV	5	4	5.0	0.286	0.671	ns
ELISA_competition	B2_20-mer_IV	B2_PostF_IV	5	4	12.0	0.730	0.939	ns
ELISA_competition	B1_1-mer_Ø	B1_20-mer_Ø	4	5	11.0	0.898	0.958	ns
ELISA_competition	B1_1-mer_Ø	B1_10-mer_Ø	4	5	12.0	0.701	0.939	ns
ELISA_competition	B1_1-mer_Ø	B1_PostF_Ø	4	4	12.0	0.186	0.513	ns
ELISA_competition	B1_10-mer_Ø	B1_20-mer_Ø	5	5	10.0	0.666	0.939	ns
ELISA_competition	B1_10-mer_Ø	B1_PostF_Ø	5	4	16.0	0.109	0.428	ns
ELISA_competition	B1_20-mer_Ø	B1_PostF_Ø	5	4	16.0	0.109	0.428	ns
ELISA_competition	B2_1-mer_Ø	B2_20-mer_Ø	4	5	13.0	0.556	0.883	ns
ELISA_competition	B2_1-mer_Ø	B2_10-mer_Ø	4	5	12.0	0.730	0.939	ns
ELISA_competition	B2_1-mer_Ø	B2_PostF_Ø	4	4	16.0	0.029	0.190	ns
ELISA_competition	B2_10-mer_Ø	B2_20-mer_Ø	5	5	13.0	1.000	1.000	ns
ELISA_competition	B2_10-mer_Ø	B2_PostF_Ø	5	4	20.0	0.020	0.176	ns
ELISA_competition	B2_20-mer_Ø	B2_PostF_Ø	5	4	20.0	0.020	0.176	ns

data_comp_longer %>%
  filter(conformation == "PreF") %>%
  ggplot(aes(x = timepoint_group,
            y = ELISA_competition,
            color = group,
            fill = group)) +
  geom_point(size = 2, shape = 21) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
  scale_y_log10() +
  stat_summary(fun.y = mean, 
               geom = "crossbar",
               color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed") +
  labs(title = "PreF Antigenic Sites", y = "50% Competition Titer", x = "") +
  theme(axis.ticks = element_line(size = .5),
        legend.background = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
  facet_wrap(~epitopes, nrow = 1)

4.3.2 Competition for epitopes on Postf

stat.test <- data_comp_longer %>%
  filter(conformation == "PostF") %>%
  wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope, 
              paired = FALSE, 
              p.adjust.method = "fdr", 
              comparisons = my_comparisons_postF) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
ELISA_competition	B1_1-mer_I	B1_20-mer_I	4	5	0.0	0.020	0.088	ns	6533.097	B1_1-mer….	1	7
ELISA_competition	B1_1-mer_I	B1_10-mer_I	4	5	0.0	0.020	0.088	ns	7250.111	B1_1-mer….	1	4
ELISA_competition	B1_1-mer_I	B1_PostF_I	4	4	1.0	0.059	0.127	ns	7967.125	B1_1-mer….	1	10
ELISA_competition	B1_10-mer_I	B1_20-mer_I	5	5	3.0	0.056	0.127	ns	8684.139	B1_10-me….	4	7
ELISA_competition	B1_10-mer_I	B1_PostF_I	5	4	9.0	0.905	0.905	ns	9401.153	B1_10-me….	4	10
ELISA_competition	B1_20-mer_I	B1_PostF_I	5	4	12.0	0.730	0.773	ns	10118.168	B1_20-me….	7	10
ELISA_competition	B2_1-mer_I	B2_20-mer_I	4	5	0.0	0.016	0.088	ns	10835.182	B2_1-mer….	13	19
ELISA_competition	B2_1-mer_I	B2_10-mer_I	4	5	5.0	0.286	0.381	ns	11552.196	B2_1-mer….	13	16
ELISA_competition	B2_1-mer_I	B2_PostF_I	4	4	0.0	0.029	0.088	ns	12269.210	B2_1-mer….	13	22
ELISA_competition	B2_10-mer_I	B2_20-mer_I	5	5	7.0	0.310	0.385	ns	12986.224	B2_10-me….	16	19
ELISA_competition	B2_10-mer_I	B2_PostF_I	5	4	0.0	0.016	0.088	ns	13703.238	B2_10-me….	16	22
ELISA_competition	B2_20-mer_I	B2_PostF_I	5	4	3.0	0.111	0.195	ns	14420.252	B2_20-me….	19	22
ELISA_competition	B1_1-mer_II	B1_20-mer_II	4	5	3.5	0.125	0.205	ns	15137.266	B1_1-mer….	2	8
ELISA_competition	B1_1-mer_II	B1_10-mer_II	4	5	8.5	0.771	0.793	ns	15854.280	B1_1-mer….	2	5
ELISA_competition	B1_1-mer_II	B1_PostF_II	4	4	0.0	0.026	0.088	ns	16571.294	B1_1-mer….	2	11
ELISA_competition	B1_10-mer_II	B1_20-mer_II	5	5	6.5	0.236	0.327	ns	17288.309	B1_10-me….	5	8
ELISA_competition	B1_10-mer_II	B1_PostF_II	5	4	1.0	0.034	0.088	ns	18005.323	B1_10-me….	5	11
ELISA_competition	B1_20-mer_II	B1_PostF_II	5	4	2.0	0.064	0.127	ns	18722.337	B1_20-me….	8	11
ELISA_competition	B2_1-mer_II	B2_20-mer_II	4	5	1.0	0.032	0.088	ns	19439.351	B2_1-mer….	14	20
ELISA_competition	B2_1-mer_II	B2_10-mer_II	4	5	4.0	0.190	0.297	ns	20156.365	B2_1-mer….	14	17
ELISA_competition	B2_1-mer_II	B2_PostF_II	4	4	0.0	0.029	0.088	ns	20873.379	B2_1-mer….	14	23
ELISA_competition	B2_10-mer_II	B2_20-mer_II	5	5	8.0	0.421	0.474	ns	21590.393	B2_10-me….	17	20
ELISA_competition	B2_10-mer_II	B2_PostF_II	5	4	0.0	0.016	0.088	ns	22307.407	B2_10-me….	17	23
ELISA_competition	B2_20-mer_II	B2_PostF_II	5	4	2.0	0.064	0.127	ns	23024.421	B2_20-me….	20	23
ELISA_competition	B1_1-mer_IV	B1_20-mer_IV	4	5	1.0	0.032	0.088	ns	23741.435	B1_1-mer….	3	9
ELISA_competition	B1_1-mer_IV	B1_10-mer_IV	4	5	4.5	0.219	0.320	ns	24458.449	B1_1-mer….	3	6
ELISA_competition	B1_1-mer_IV	B1_PostF_IV	4	4	2.0	0.114	0.195	ns	25175.464	B1_1-mer….	3	12
ELISA_competition	B1_10-mer_IV	B1_20-mer_IV	5	5	6.0	0.222	0.320	ns	25892.478	B1_10-me….	6	9
ELISA_competition	B1_10-mer_IV	B1_PostF_IV	5	4	6.0	0.413	0.474	ns	26609.492	B1_10-me….	6	12
ELISA_competition	B1_20-mer_IV	B1_PostF_IV	5	4	13.0	0.556	0.607	ns	27326.506	B1_20-me….	9	12
ELISA_competition	B2_1-mer_IV	B2_20-mer_IV	4	5	0.0	0.016	0.088	ns	28043.520	B2_1-mer….	15	21
ELISA_competition	B2_1-mer_IV	B2_10-mer_IV	4	5	6.0	0.413	0.474	ns	28760.534	B2_1-mer….	15	18
ELISA_competition	B2_1-mer_IV	B2_PostF_IV	4	4	0.0	0.029	0.088	ns	29477.548	B2_1-mer….	15	24
ELISA_competition	B2_10-mer_IV	B2_20-mer_IV	5	5	7.0	0.310	0.385	ns	30194.562	B2_10-me….	18	21
ELISA_competition	B2_10-mer_IV	B2_PostF_IV	5	4	0.0	0.016	0.088	ns	30911.576	B2_10-me….	18	24
ELISA_competition	B2_20-mer_IV	B2_PostF_IV	5	4	3.0	0.111	0.195	ns	31628.590	B2_20-me….	21	24

data_comp_longer %>%
  filter(conformation == "PostF") %>%
  ggplot(aes(x = timepoint_group,
            y = ELISA_competition,
            color = group,
            fill = group)) +
  geom_point(size = 2, shape = 21) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
  scale_y_log10() +
  stat_summary(fun.y = mean, 
               geom = "crossbar",
               color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed") +
  labs(title = "PostF Antigenic Sites", y = "50% Competition Titer", x = "") +
  theme(axis.ticks = element_line(size = .5),
        legend.background = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
  facet_wrap(~epitopes, nrow = 1)

5 Save metadata from B cell repertoire sequencing

Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.

rep_seq_stat <-  lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <-  lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE) 
rep_seq_stat <- rep_seq_stat %>%
  select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))

write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )

6 Comparing repertoires

6.1 Processing clonotype datasets

After merging the antigen-specific sequences and bulk sequencing, we run the the clonotype module from IgDiscover to assign clonotype grouping to each sequence. We have done that for both the individualized germline and the KIMDB database. Here we are reading those datasets and doing wrangling the data in the required format.

ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls) & grepl("individualized|nestor-rm", ls) ]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
  select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
  mutate(
    Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
    ID = gsub("_.*", "", ID_timepoint),
    database = plyr::mapvalues(.id,
      from = c("nestor-rm", "individualized"),
      to = c("KIMDB", "Individualized")
    ),
    database = factor(database, levels = c("KIMDB", "Individualized")),
    Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
    v_call = ifelse(is.na(v_call), V_gene, v_call),
    j_call = ifelse(is.na(j_call), J_gene, j_call)
  ) %>%
  group_by(.id, ID_timepoint) %>%
  distinct(new_name, .keep_all = TRUE) %>%
  ungroup() %>%
  filter(database %in% c("KIMDB", "Individualized"))

6.2 Plot number of sequences per animal

Plotting sequencing depth for each timepoint per animal. This indicates that the sequencing depth for Boost 1 was lower, thus for that reason all the analysis were normalized by sequencing depth to take that into account.

# samples from the biological replicates after boost 2 (BR2-B2) from E17 were merged with the technical replicates after boost 2 (TR1-B2)
# this was done due to low sequencing depth for that animal and large expansion of uncharacterized clones 
full_rep_indiv <- read.table("../data/processed_data/summary_sequencing_table.tsv", 
                             header = TRUE) %>%
  separate(ID, sep = "_", into = c("sample", "ID")) %>%
  filter(sample != "TR2-B2") %>%
  mutate(sample = ifelse(sample == "BR2-B2", "TR1-B2", sample),
         timepoint = case_when(sample == "igm" ~ "PV",
                            grepl("B1", sample) ~ "B1",
                            grepl("B2", sample) ~ "B2"),
         timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
  group_by(timepoint, ID) %>%
  summarise(across(where(is.numeric), sum)) %>%
  ungroup()
  
# since here we wanted to compare all groups, we used Dunn's test
stat.test <- full_rep_indiv %>%
              dunn_test(formula = assignment_filtering.has_cdr3 ~ timepoint,
                          p.adjust.method = "fdr") %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
assignment_filtering.has_cdr3	PV	B1	9	9	-3.6525703	0.0002596	0.0007789	***	1393265	PV, B1	1	2
assignment_filtering.has_cdr3	PV	B2	9	9	-0.2672612	0.7892680	0.7892680	ns	1491771	PV, B2	1	3
assignment_filtering.has_cdr3	B1	B2	9	9	3.3853091	0.0007110	0.0010665	**	1590276	B1, B2	2	3

full_rep_indiv %>%
  ggdotplot(y = "assignment_filtering.has_cdr3", x = "timepoint", 
            group = "timepoint", 
            fill = "ID", 
            size = 1,) +
  geom_boxplot(alpha = .2, outlier.shape = NA) +
  ggpubr::stat_compare_means(method = "kruskal") +
  labs(x = "Timepoint", y = "# good quality aligned sequences") +
  scale_fill_viridis_d() +
  stat_pvalue_manual(stat.test %>% mutate(p.adj = round(p.adj, 4)), label = "p.adj") +
   ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)

write.csv(full_rep_indiv, paste0("../", result.dir, "repertoire_depth.csv"), row.names = F)
ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)

6.3 Generate data for clone size per animal per database

rds_summary <- rds_merge %>%
  group_by(database, ID_timepoint, grp) %>%
  summarise(
    ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
    clonal_size = n(),
    database,
    sc_clone_grp = first(sc_clone_grp),
    V_errors = mean(V_errors),
    specificity_group = first(specificity_group),
    cdr3_aa_length = mean(cdr3_aa_length),
    v_call = first(v_call),
    j_call = first(j_call)
  ) %>%
  ungroup() %>%
  group_by(database, ID_timepoint) %>%
  mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
  ungroup() %>%
  distinct()

rds_summary_noPV <- rds_summary %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell")

rds_summary_save <- rds_summary %>%
  filter(Timepoint %in% c("Single-cell", "B1","B2"),
         database == "Individualized") %>%
  group_by(ID_timepoint) %>%
  summarise(mean_clonal_size = mean(clonal_size),
            median_clonal_size = median(clonal_size),
            geom_clonal_size = exp(mean(log(clonal_size))),
            unique_clones = sum(clonal_size == 1),
            total_clones_detected = n(),
            percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2)) 

rds_summary_save %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

ID_timepoint	mean_clonal_size	median_clonal_size	geom_clonal_size	unique_clones	total_clones_detected	percentage_unique_clones
E11_B1	73.765766	24.0	23.618580	7	111	6.31
E11_B2	65.842324	18.0	16.139896	22	241	9.13
E11_Single-cell	1.596491	1.0	1.310902	169	228	74.12
E12_B1	96.802632	8.5	11.635387	5	76	6.58
E12_B2	62.696296	9.0	9.733452	18	135	13.33
E12_Single-cell	1.556000	1.0	1.304827	185	250	74.00
E14_B1	79.458333	9.0	9.683393	11	96	11.46
E14_B2	111.530612	10.5	12.335661	29	196	14.80
E14_Single-cell	2.041096	1.0	1.527244	139	219	63.47
E16_B1	73.535519	14.0	15.223779	14	183	7.65
E16_B2	118.491228	31.5	29.123260	17	228	7.46
E16_Single-cell	1.996094	1.0	1.489967	169	256	66.02
E17_B1	38.460526	8.0	9.753009	8	76	10.53
E17_B2	55.614035	6.0	7.902730	31	171	18.13
E17_Single-cell	1.665158	1.0	1.381309	151	221	68.33
E18_B1	41.323529	11.0	11.772174	9	102	8.82
E18_B2	72.662921	21.0	15.643111	23	178	12.92
E18_Single-cell	1.641026	1.0	1.368241	135	195	69.23
E21_B1	35.201835	5.0	6.887328	21	109	19.27
E21_B2	45.202312	14.0	13.455144	16	173	9.25
E21_Single-cell	1.674208	1.0	1.385378	151	221	68.33
E23_B1	59.675497	9.0	11.417886	19	151	12.58
E23_B2	63.044025	13.0	14.180265	18	159	11.32
E23_Single-cell	1.916279	1.0	1.484490	136	215	63.26
E24_B1	55.469231	10.0	11.115858	18	130	13.85
E24_B2	62.310735	9.0	10.429075	29	177	16.38
E24_Single-cell	1.829897	1.0	1.400170	137	194	70.62

write.csv(rds_summary_save, 
          paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)

6.4 Process and merge dataset with metadata

rds_indiv <- rds_summary %>%
  filter(
    database == "Individualized",
    Timepoint != "Single-cell",
    Timepoint != "PV"
  ) %>%
  mutate(
    LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
    LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
  )
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
  tidyr::expand(ID, Timepoint, grp) %>%
  filter(Timepoint != "Single-cell")

rds_indiv <- rds_indiv %>%
  right_join(all) %>%
  mutate(
    ID_timepoint = paste(ID, Timepoint, sep = "_"),
    database = "individualized",
    clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
  ) %>%
  arrange(grp, ID_timepoint) %>%
  tidyr::fill(LOR, Group, sc_clone_grp)

6.5 Cumulative area clone size plots

generates cumulative area plots for clone sizes of 20-mer and 1-mer specificities, further divided by specificity groups (DP & PreF). The code starts by summarizing the clone sizes for each combination of “Group,” “Timepoint,” and “specificity_group” from the “rds_indiv” data. The specificity group “PostF” is filtered out for this plot. The data is then plotted using ggplot2 library to create area plots with the x-axis representing “Timepoint,” the y-axis representing the cumulative “Clonal size”, and the area filled based on “specificity_group” (DP & PreF). Each “Group” is presented as a facet, showing the cumulative growth of clone sizes for the specified specificities over different timepoints. The area plots offer a visual representation of how the clone sizes evolve over time, highlighting differences between the two specificity groups (DP & PreF) for both 20-mer and 1-mer clones.

6.5.1 Clone size area plot (cumulative) of 20-mer and 1-mer, divided by specificity (DP & PreF)

rds_indiv %>%
  group_by(Group, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group)

ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)

6.5.2 Clone size area plot (cumulative) per animal, divided by specificity (DP & PreF)

rds_indiv %>%
  group_by(Group, ID, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 30000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group + ID, ncol = 5)

6.6 Correlation clone size per animal per database

The provided code plots the clonal sizes from two databases, “KIMDB” and “Individualized”, for different timepoints. It creates scatter plots with clonal sizes from the “Individualized” database on the x-axis and “KIMDB” database on the y-axis, allowing visual comparison and assessment of correlation. The dashed line with a slope of 1 and an intercept of 0 represents the ideal correlation line. The plots are grouped by timepoints, providing a concise visual understanding of the clonal size relationship between the two databases over time.

df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
  for (databases in unique(rds_summary_noPV$database)) {
    rds_timepoint <- rds_summary_noPV %>%
      filter(Timepoint == timepoints, database == databases) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    rds_indiv_corr <- rds_summary_noPV %>%
      filter(
        database == "Individualized",
        Timepoint == timepoints
      ) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    length(rds_indiv_corr) <- length(rds_timepoint)

    df <- data.frame(
      size_indiv = rds_indiv_corr,
      clonal_size = rds_timepoint,
      Timepoint = timepoints,
      database = databases
    )
    df[is.na(df)] <- 0
    df_all <- rbind(df, df_all)
  }
}


df_all %>%
  filter(database != "Individualized") %>%
  mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
  ggplot(aes(x = size_indiv, y = clonal_size)) +
  geom_point(size = 1) +
  geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
  scale_x_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  scale_y_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~ Timepoint)

ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)

6.7 IGHV-J pairing per database

The presented R code analyzes IGHV-J pairing for two databases, “Individualized” and “KIMDB.” The code preprocesses data from the “rds_summary” dataframe, extracting V and J gene calls, and calculating clonal size percentages for each V-J combination. It then creates a heatmap plot (plot_vj) with IGHV alleles on the x-axis and IGHJ alleles on the y-axis, using the fill color to represent the clonal size mean percentage for each V-J pairing. The heatmap is grouped by specificity groups (“Total,” “PreF,” and “DP”) and clonotype type (“20-mer” and “1-mer”) for each database, providing a compact visual comparison of the IGHV-J pairing distributions between the two databases.

for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
  # check if we want to filter clonotypes by size or not
  filter(database == i) %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  mutate(v_call = sub("\\*.*","", v_call),
         j_call = sub("\\*.*|-.*","", j_call),
         clonal_size_log = log10(clonal_size),
         Group = factor(Group, levels = c("20-mer", "1-mer")),
         specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
  group_by(ID) %>%
  mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
  ungroup() %>% 
  group_by(Group, specificity_group, v_call, j_call) %>%
  summarise(clonal_size_mean_perc = mean(clonal_size_perc), clonal_size_mean = mean(clonal_size)) %>%
  filter(specificity_group != "PostF") %>% droplevels()

plot_vj <-  rds_summary_noPV %>%
    ggplot(aes(x= v_call, y = j_call, fill = clonal_size_mean_perc)) +
      geom_tile(color = "black") +
      scale_fill_viridis_c(option = "viridis", direction = 1, limits = c(0, 1.5), breaks = c(0,0.5, 1, 1.5)) +
      scale_y_discrete(position = "right") +
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      coord_equal() +
      theme(axis.text.x = element_text(angle = 90, size = 5, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
            axis.text.y = element_text(face = "bold", colour = "black", size = 5),
            strip.text = element_text(face = "bold", size = 7),
            panel.grid.major = element_blank(), 
            panel.grid.minor = element_blank(),
            legend.position =  "top",
            axis.ticks = element_line(size = .5),
            legend.title = element_text()) +
      labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles", title = paste0(i, " Database")) +
      facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")

print(plot_vj)

ggsave(plot = plot_vj, filename = paste0("../",result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}

6.8 Count unique V-J pairs

The provided R code counts and compares unique V-J pairing occurrences in the “Individualized” database. It preprocesses the data, calculates the number of unique V-J pairs for each group, and performs statistical tests. The results are visualized using a dot plot with mean values and p-value labels, enabling easy comparison of unique V-J pairing counts among different groups.

# groups to perform statistical comparisons
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
                       c("Total_B2_20-mer", "Total_B2_1-mer"),
                       c("PreF_B1_20-mer", "PreF_B1_1-mer"),
                       c("PreF_B2_20-mer", "PreF_B2_1-mer"),
                       c("DP_B1_20-mer", "DP_B1_1-mer"),
                       c("DP_B2_20-mer", "DP_B2_1-mer"))

rds_summary_noPV <- rds_summary %>%
  # If want to remove clonal sizer < 1, do it here.
  filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) %>%
  summarise(unique_v_j = n()) %>%
  ungroup()

rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)

stat.test <- rds_summary_noPV %>%
              wilcox_test(formula = unique_v_j ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
unique_v_j	Total_B1_20-mer	Total_B1_1-mer	5	4	14	0.413	0.496	ns	137.040	Total_B1….	1	2
unique_v_j	Total_B2_20-mer	Total_B2_1-mer	5	4	11	0.901	0.901	ns	150.288	Total_B2….	3	4
unique_v_j	PreF_B1_20-mer	PreF_B1_1-mer	5	4	4	0.176	0.264	ns	163.536	PreF_B1_….	5	6
unique_v_j	PreF_B2_20-mer	PreF_B2_1-mer	5	4	0	0.016	0.048		176.784	PreF_B2_….	7	8
unique_v_j	DP_B1_20-mer	DP_B1_1-mer	5	4	20	0.016	0.048		190.032	DP_B1_20….	9	10
unique_v_j	DP_B2_20-mer	DP_B2_1-mer	5	4	18	0.064	0.127	ns	203.280	DP_B2_20….	11	12

rds_summary_noPV %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)

6.9 VennDiagram of HV-HJ sharing

The provided R code generates a Venn diagram to visualize the sharing of HV-HJ combinations between “20-mer” and “1-mer” clonotypes in the “Individualized” database. The code preprocesses the data from the “rds_summary” dataframe, filtering out “PostF” specificity and organizing the data based on “Group” (clonotype type). It then creates two lists, one for “20-mer” and another for “1-mer” clonotypes, containing distinct HV-HJ combinations for each group. The Venn diagram is plotted using ggVennDiagram, representing the shared and unique HV-HJ combinations between the two clonotype types. The diagram provides a concise visual understanding of the overlap and differences in HV-HJ pairing between “20-mer” and “1-mer” clonotypes in the “Individualized” database.

rds_summary_noPV <- rds_summary %>%
  # decide if clonal size greater than 1 should be included
  filter(database == "Individualized", Timepoint != "Single-cell") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) 

list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
                 "1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])

ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7, 
                             label_alpha = 0,
                             set_color = "black",
                             set_size = 9) +
  scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
  scale_color_manual(values = c("#5F90B0", "#D896C1")) +
  theme(legend.position = "none")

ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)

7 Generate processed files for diversity index estimation

Intermediate files are commented out, so it will not save all of them. One could uncomment them if all the intermediate files are needed. Intermediate files were used to run Recon (v2.5) (Kaplinsky & Arnaout, Nat Commmun, 2016) according to default parameters, more about running Recon is described in the next section Recon estimates for RSV-specific diversity.

# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
  mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))

# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()

# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
  # save full clonotype files
  filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
    filter(ID_timepoint_spec == animals) %>%
    as.data.frame()
 # write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)

  # save summary files
  clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
  #write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }

# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
  # save summary files
  clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
    filter(ID_timepoint == animals) %>%
    as.data.frame() %>%
    mutate(
      specificity_group = "Total",
      ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
    ) %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
 # write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }


# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()

for (animals in unique(clonotype_rsv$ID)) {
  # save summary files
  clonotype_rsv %>%
    filter(ID == animals) %>%
    as.data.frame() %>%
    group_by(grp, timepoint, ID) %>%
    summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup() 
   # write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
    }

to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
  select(clonal_size, ID_timepoint_spec) %>%
  group_by(clonal_size, ID_timepoint_spec) %>%
  summarise(size = n()) %>%
  ungroup()


for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
  to_recon_table <- to_recon_rsv %>%
    filter(ID_timepoint_spec == i) %>%
    select(clonal_size, size)

 # fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}

7.1 Calculating Species richness

7.1.1 Chao1 estimates for RSV-specific diversity

Species richness (Chao1) was calculated using the vegan r package. The samples were first subsampled 100 times for each animal/timepoint and then the species richness was estimated. The mean of those 100x replicates were used for plotting.

Here is the processing and estimation of species richness:

rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))


# subsample and replicate the subsampling 100 times for higher accuracy

chaox100 <- function(x, value_to_subsample) {
  replicate(100, {
    subsample <- vegan::rrarefy(x, value_to_subsample)
    chao <- vegan::estimateR(subsample)
    return(chao)
  })
}

diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))

# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
  print(spec_timepoint)
  rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
  min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
  chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}

## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"

change_names <- function(x) {
  names(x) <- gsub("_.*", "", names(x))
  x
}

# adjusted dataset for plotting
{
  chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
  chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
  chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
  # save intermediate file
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
  # diversity mean of x100 replicates per animal
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    group_by(algorithm, Timepoint_specificity) %>%
    summarise_all(.funs = mean) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)

  chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
    tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
    mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ))
}

7.1.2 Chao1 plotting between groups for RSV-specific diversity

Here is the plot and comparison between the vaccinated groups:

chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"

chao_results_df <- chao_results_df %>%
  filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
  filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
    Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
  summarise(value = mean(value)) %>%
  tidyr::drop_na() %>%
  ungroup()

write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)


stat.test <- chao_results_df %>%
              wilcox_test(formula = value ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
value	Total_B1_20-mer	Total_B1_1-mer	5	4	16	0.190	0.285	ns	272.3352	Total_B1….	1	2
value	Total_B2_20-mer	Total_B2_1-mer	5	4	15	0.286	0.343	ns	301.8278	Total_B2….	3	4
value	PreF_B1_20-mer	PreF_B1_1-mer	5	4	7	0.556	0.556	ns	331.3205	PreF_B1_….	5	6
value	PreF_B2_20-mer	PreF_B2_1-mer	5	4	2	0.064	0.127	ns	360.8131	PreF_B2_….	7	8
value	DP_B1_20-mer	DP_B1_1-mer	5	4	20	0.016	0.048		390.3058	DP_B1_20….	9	10
value	DP_B2_20-mer	DP_B2_1-mer	5	4	20	0.016	0.048		419.7984	DP_B2_20….	11	12

chao_results_df %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "value", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Chao1)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)

7.1.3 Recon estimates for RSV-specific diversity

According to Recon default settings and tutorial (check Recon manual), the count files generated previously were used to create the fitfiles.txt that were used as an input for generating the diversity_table.txt for all the samples.

To generate the the fitfile for each count file, the bash script used in a for loop was:

#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE

python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt

Then, each fit file was used for generating the diversity_table.txt with the command:

python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt

The results from diversity_table.txt for all the samples were used as input for plotting.

recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
   Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  ungroup()

write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
  

stat.test <- recon_res %>%
              wilcox_test(formula = est_0.0D ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
est_0.0D	Total_B1_20-mer	Total_B1_1-mer	5	4	16	0.190	0.228	ns	275.7082	Total_B1….	1	2
est_0.0D	Total_B2_20-mer	Total_B2_1-mer	5	4	16	0.190	0.228	ns	304.8120	Total_B2….	3	4
est_0.0D	PreF_B1_20-mer	PreF_B1_1-mer	5	4	7	0.556	0.556	ns	333.9159	PreF_B1_….	5	6
est_0.0D	PreF_B2_20-mer	PreF_B2_1-mer	5	4	2	0.064	0.127	ns	363.0197	PreF_B2_….	7	8
est_0.0D	DP_B1_20-mer	DP_B1_1-mer	5	4	20	0.016	0.048		392.1236	DP_B1_20….	9	10
est_0.0D	DP_B2_20-mer	DP_B2_1-mer	5	4	20	0.016	0.048		421.2274	DP_B2_20….	11	12

recon_res %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Recon)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)

7.2 Somatic hypermutation comparisons

7.2.1 Somatic hypermutation over time for individualized database

Here the SHM is shown for all the sequences for every animal combined. Data is divided by 20-mer, 1-mer, and specificities. T

rds_indiv_total <- rds_merge %>%
  mutate(specificity_group = "Total") 

rds_indiv_plot <- rbind(rds_merge, rds_indiv_total)


rds_indiv_plot <- rds_indiv_plot %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup() %>% drop_na(Group_specificity)

stat.test <- rds_indiv_plot %>%
              wilcox_test(formula = V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p.adj.signif	y.position	groups	xmin	xmax
V_errors	Total_B1_20-mer	Total_B1_1-mer	71438	41890	1096663994	****	140.760	Total_B1….	1	2
V_errors	Total_B2_20-mer	Total_B2_1-mer	141383	92712	4905379554	****	151.272	Total_B2….	3	4
V_errors	PreF_B1_20-mer	PreF_B1_1-mer	20774	21126	196698706	****	161.784	PreF_B1_….	5	6
V_errors	PreF_B2_20-mer	PreF_B2_1-mer	48561	47362	964010280	****	172.296	PreF_B2_….	7	8
V_errors	DP_B1_20-mer	DP_B1_1-mer	38256	8444	139047744	****	182.808	DP_B1_20….	9	10
V_errors	DP_B2_20-mer	DP_B2_1-mer	82239	33204	1055796386	****	193.320	DP_B2_20….	11	12

counts <- rds_indiv_plot %>% group_by(Group, specificity_group) %>% summarise(counts = n())

rds_indiv_plot %>%
  ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
#  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)

7.2.2 Plot somatic hypermutation over time summarized by animal

Here the SHM data is summarized per macaque, so each dot represents the average SHM of all the antigen-specific sequences for one animal.

rds_indiv_plot_summ <- rds_indiv_plot %>%
  filter(Group_specificity != "PostF") %>%
  group_by(ID, Group_specificity) %>%
  summarise(avg_V_errors = mean(V_errors, na.rm = TRUE),
            sd_V_errors = sd(V_errors, na.rm = TRUE),
            n = n()) %>%
  ungroup() %>%
  tidyr::drop_na()

stat.test <- rds_indiv_plot_summ %>%
              wilcox_test(formula = avg_V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")

.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
avg_V_errors	Total_B1_20-mer	Total_B1_1-mer	5	4	5	0.286	0.826	ns	25.37176	Total_B1….	1	2
avg_V_errors	Total_B2_20-mer	Total_B2_1-mer	5	4	6	0.413	0.826	ns	26.81147	Total_B2….	3	4
avg_V_errors	PreF_B1_20-mer	PreF_B1_1-mer	5	4	8	0.730	0.876	ns	28.25118	PreF_B1_….	5	6
avg_V_errors	PreF_B2_20-mer	PreF_B2_1-mer	5	4	8	0.730	0.876	ns	29.69090	PreF_B2_….	7	8
avg_V_errors	DP_B1_20-mer	DP_B1_1-mer	5	4	4	0.190	0.826	ns	31.13061	DP_B1_20….	9	10
avg_V_errors	DP_B2_20-mer	DP_B2_1-mer	5	4	10	1.000	1.000	ns	32.57032	DP_B2_20….	11	12

counts <- rds_indiv_plot_summ %>% group_by(Group_specificity) %>% summarise(max_n = max(n), min_n = min(n), total_n = sum(n))

rds_indiv_plot_summ %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "avg_V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 22) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

write.csv(rds_indiv_plot_summ, paste0("../", result.dir, "rsv_specific_mean_SHM_per_animal.csv"), row.names = FALSE)

ggsave(paste0("../", result.dir, "rsv_specific_mean_SHM_per_animal.pdf"), width = 6, height = 3)

7.3 Plot HCDR3 length over time for individualized database divided by 20-mer and 1-mer, and specificities

rds_indiv_hcdr3 <- rds_indiv %>%
 # filter(database == "Individualized", Timepoint == c("B1", "B2")) %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(
    Group_specificity = paste(Group, specificity_group, sep = "_"),
    specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
    Timepoint = factor(Timepoint, levels = c("B1", "B2"))
  ) %>% drop_na()
  
counts <- rds_indiv_hcdr3 %>% group_by(specificity_group, Timepoint, Group) %>% summarise(N=paste0('n = ',n()))
counts

## # A tibble: 12 × 4
## # Groups:   specificity_group, Timepoint [6]
##    specificity_group Timepoint Group  N      
##    <fct>             <fct>     <chr>  <chr>  
##  1 Total             B1        1-mer  n = 383
##  2 Total             B1        20-mer n = 651
##  3 Total             B2        1-mer  n = 682
##  4 Total             B2        20-mer n = 976
##  5 PreF              B1        1-mer  n = 205
##  6 PreF              B1        20-mer n = 205
##  7 PreF              B2        1-mer  n = 389
##  8 PreF              B2        20-mer n = 346
##  9 DP                B1        1-mer  n = 144
## 10 DP                B1        20-mer n = 413
## 11 DP                B2        1-mer  n = 250
## 12 DP                B2        20-mer n = 583

rds_indiv_hcdr3 %>%
  ggplot(aes(x = cdr3_aa_length, fill = Group)) +
  geom_density(alpha = .7) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
  labs(y = "Density", x = "HCDR3 aa length") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
 # geom_text(data = counts, aes(label = N, fill = NA), x = Inf, y = Inf, hjust = 1, vjust = 1, size = 4, color = "black") +
  facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")

ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)

8 Comparison of discovered alleles (individualized database)

8.1 Plot V alleles unique and shared per animal, stacked plot

ls <- list.files("../data/databases/individualized", recursive = T, full.names = T, pattern = "V.fasta")
names(ls) <- basename(dirname(ls))
ls <- ls[-1] # remove combined V genes
individualized_dbs <- lapply(ls, Biostrings::readDNAStringSet)
size_indv_db <- data.frame(
  v_count = unlist(lapply(individualized_dbs, length)),
  type = "Shared"
) %>%
  tibble::rownames_to_column("ID")
individualized_dbs <- unlist(Biostrings::DNAStringSetList(individualized_dbs))
duplicated_names <- c(names(unique(individualized_dbs[length(individualized_dbs):1, ])), names(unique(individualized_dbs)))
individualized_dbs_sel <- individualized_dbs[setdiff(names(individualized_dbs), duplicated_names)]
size_unique_db <- data.frame(
  v_unique = table(substr(names(unique(individualized_dbs_sel)), 1, 3)),
  type = "Unique") %>%
  dplyr::rename(v_count = v_unique.Freq,
         ID = v_unique.Var1)


unique_v_indiv <- rbind(size_indv_db, size_unique_db) %>%
  add_row(ID = c("E11", "E24"), v_count = rep(0), type = rep("Unique")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )
unique_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_unique_shared_per_animal.csv"), row.names = FALSE)

unique_v_indiv %>%
  mutate(type = factor(type, levels = c("Unique", "Shared"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "unique_and_shared_alleles.pdf"), width = 8, height = 6.38)

8.2 Plot V alleles validated or not per animal, stacked plot

kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")

joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]

## DNAStringSet object of length 15:
##      width seq                                              names               
##  [1]   296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
##  [2]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
##  [3]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [4]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [5]   298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
##  ...   ... ...
## [11]   296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12]   299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13]   296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14]   299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15]   297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01

size_uniq_kimdb <- data.frame(
  v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
  type = "Not validated"
) %>%
  dplyr::rename(
    v_count = v_unique.Freq,
    ID = v_unique.Var1
  )
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
  add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )

kimdb_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)

kimdb_v_indiv %>%
  mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)

9 Phylogenetic tree LOR21 and LOR24 monoclonal antibodies

9.1 Processing data

clonal_tree_data <- clonotype_rsv %>%
  ungroup() %>%
  distinct(new_name, .keep_all = TRUE) %>%
  group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
  add_tally(name = "clonal_size") %>%
  ungroup()

mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03", "LOR74" = "E11_05_B06")

# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")

# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")

9.2 Generate fasta files for tree plotting

Select all the sequences that are part of the same clonotypes from the mAbs of interest, in this case LOR21, LOR24, and LOR74.The selected sequences were saved as fasta for further multiple sequence alignment and phylogenetic tree inference.

9.3 Heavy chain

sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA heavy chain
  UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
  UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
  UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
  names(UCA) <- paste0(mab_name,"_UCA")
  # select sequences part of the same clonotype
  group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clonal_tree_data %>%
    filter(grp == group) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
            row.names = FALSE)
  }
  # deduplicating sequences
  fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  #adding single cell sequences
  sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
  
  #saving duplicated
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
    sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
  
  fasta <- c(fasta, sc_seqs, UCA)
  fasta <- fasta[unique(names(fasta))]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
 
  
}

## [1] 1
## [1] 2
## [1] 3

9.4 Light chain

for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA light chain
  UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
  UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
  UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
  names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
  # select light chain clonotypes
  group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clono_light_chains %>%
    filter(grp == group, substr(name,1,3) == substr(mab,1,3)) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
            row.names = FALSE)
  }
  # save object as BStringset
  fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
  
  # save duplicated values
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
  
  fasta <- c(fasta, UCA_LC)
  fasta <- fasta[names(fasta)]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir,  mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}

## [1] 1
## [1] 2
## [1] 3

9.5 Run MUSCLE in bash

Do a Multiple Sequence Alignment using MUSCLE (v 5.1) for all the fasta files generated through out this analysis. Following that, run FastTree (v 2.1.11) for all the aligned sequences and save tree output to be plotted on the following code.


source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/    
for f in *.fasta; do muscle -align $f -output $f.aln; done

for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done

## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 217 seqs, avg length 364, max 370
## 
## 00:00 4.6Mb  CPU has 8 cores, running 8 threads
## 00:00 4.8Mb   0.0043% Calc posteriors
00:01 180Mb    0.96% Calc posteriors 
00:02 212Mb     1.9% Calc posteriors
00:03 222Mb     2.8% Calc posteriors
00:04 234Mb     3.8% Calc posteriors
00:05 254Mb     4.4% Calc posteriors
00:06 261Mb     4.7% Calc posteriors
00:07 284Mb     5.2% Calc posteriors
00:08 290Mb     5.8% Calc posteriors
00:09 302Mb     6.3% Calc posteriors
00:10 309Mb     6.9% Calc posteriors
00:11 326Mb     7.3% Calc posteriors
00:12 327Mb     7.8% Calc posteriors
00:13 348Mb     8.4% Calc posteriors
00:14 359Mb     9.0% Calc posteriors
00:15 374Mb     9.6% Calc posteriors
00:16 402Mb    10.0% Calc posteriors
00:17 408Mb    10.5% Calc posteriors
00:18 415Mb    11.0% Calc posteriors
00:19 430Mb    11.7% Calc posteriors
00:20 441Mb    12.1% Calc posteriors
00:21 453Mb    12.6% Calc posteriors
00:22 454Mb    13.0% Calc posteriors
00:23 468Mb    13.5% Calc posteriors
00:24 491Mb    14.0% Calc posteriors
00:25 514Mb    14.6% Calc posteriors
00:26 530Mb    15.3% Calc posteriors
00:27 557Mb    15.7% Calc posteriors
00:28 561Mb    16.2% Calc posteriors
00:29 581Mb    16.7% Calc posteriors
00:30 606Mb    17.3% Calc posteriors
00:31 622Mb    17.8% Calc posteriors
00:32 648Mb    18.4% Calc posteriors
00:33 667Mb    19.0% Calc posteriors
00:34 671Mb    19.3% Calc posteriors
00:35 691Mb    19.9% Calc posteriors
00:36 706Mb    20.4% Calc posteriors
00:37 714Mb    20.8% Calc posteriors
00:38 729Mb    21.3% Calc posteriors
00:39 730Mb    21.7% Calc posteriors
00:40 748Mb    22.2% Calc posteriors
00:41 761Mb    22.7% Calc posteriors
00:42 783Mb    23.2% Calc posteriors
00:43 801Mb    23.8% Calc posteriors
00:44 819Mb    24.3% Calc posteriors
00:45 805Mb    24.6% Calc posteriors
00:46 802Mb    25.0% Calc posteriors
00:47 790Mb    25.5% Calc posteriors
00:48 809Mb    26.1% Calc posteriors
00:49 700Mb    26.7% Calc posteriors
00:50 711Mb    27.3% Calc posteriors
00:51 729Mb    27.7% Calc posteriors
00:52 741Mb    28.3% Calc posteriors
00:53 764Mb    28.9% Calc posteriors
00:54 679Mb    29.7% Calc posteriors
00:55 686Mb    30.4% Calc posteriors
00:56 690Mb    31.1% Calc posteriors
00:57 701Mb    31.6% Calc posteriors
00:58 687Mb    32.2% Calc posteriors
00:59 691Mb    32.7% Calc posteriors
01:00 698Mb    33.4% Calc posteriors
01:01 710Mb    34.0% Calc posteriors
01:02 725Mb    34.7% Calc posteriors
01:03 726Mb    35.1% Calc posteriors
01:04 751Mb    35.6% Calc posteriors
01:05 753Mb    36.3% Calc posteriors
01:06 772Mb    36.9% Calc posteriors
01:07 780Mb    37.5% Calc posteriors
01:08 790Mb    38.0% Calc posteriors
01:09 804Mb    38.6% Calc posteriors
01:10 814Mb    38.9% Calc posteriors
01:11 828Mb    39.5% Calc posteriors
01:12 736Mb    40.1% Calc posteriors
01:13 759Mb    40.8% Calc posteriors
01:14 765Mb    41.4% Calc posteriors
01:15 768Mb    41.9% Calc posteriors
01:16 779Mb    42.7% Calc posteriors
01:17 791Mb    43.5% Calc posteriors
01:18 803Mb    44.1% Calc posteriors
01:19 813Mb    44.7% Calc posteriors
01:20 817Mb    45.2% Calc posteriors
01:21 825Mb    45.8% Calc posteriors
01:22 734Mb    46.4% Calc posteriors
01:23 749Mb    47.0% Calc posteriors
01:24 765Mb    47.6% Calc posteriors
01:25 827Mb    48.2% Calc posteriors
01:26 839Mb    48.7% Calc posteriors
01:27 841Mb    49.1% Calc posteriors
01:28 844Mb    49.7% Calc posteriors
01:29 851Mb    50.3% Calc posteriors
01:30 858Mb    50.9% Calc posteriors
01:31 853Mb    51.3% Calc posteriors
01:32 856Mb    51.7% Calc posteriors
01:33 856Mb    52.1% Calc posteriors
01:34 852Mb    52.4% Calc posteriors
01:33 854Mb    52.7% Calc posteriors
01:34 852Mb    52.8% Calc posteriors
01:35 861Mb    53.1% Calc posteriors
01:36 874Mb    53.5% Calc posteriors
01:37 878Mb    53.9% Calc posteriors
01:38 879Mb    54.2% Calc posteriors
01:39 880Mb    54.5% Calc posteriors
01:40 786Mb    54.9% Calc posteriors
01:41 796Mb    55.4% Calc posteriors
01:42 806Mb    56.1% Calc posteriors
01:43 825Mb    56.7% Calc posteriors
01:44 844Mb    57.3% Calc posteriors
01:45 848Mb    57.7% Calc posteriors
01:46 849Mb    58.3% Calc posteriors
01:47 854Mb    59.0% Calc posteriors
01:48 860Mb    59.6% Calc posteriors
01:49 866Mb    60.2% Calc posteriors
01:50 882Mb    60.7% Calc posteriors
01:51 770Mb    61.1% Calc posteriors
01:52 780Mb    61.8% Calc posteriors
01:53 774Mb    62.3% Calc posteriors
01:54 783Mb    62.8% Calc posteriors
01:55 790Mb    63.4% Calc posteriors
01:56 799Mb    63.9% Calc posteriors
01:57 809Mb    64.3% Calc posteriors
01:58 823Mb    64.8% Calc posteriors
01:59 830Mb    65.4% Calc posteriors
02:00 828Mb    65.9% Calc posteriors
02:01 822Mb    66.4% Calc posteriors
02:02 823Mb    66.7% Calc posteriors
02:03 849Mb    67.1% Calc posteriors
02:04 860Mb    67.6% Calc posteriors
02:05 872Mb    68.3% Calc posteriors
02:06 880Mb    68.8% Calc posteriors
02:07 900Mb    69.2% Calc posteriors
02:08 905Mb    69.7% Calc posteriors
02:09 917Mb    70.3% Calc posteriors
02:10 923Mb    70.8% Calc posteriors
02:11 924Mb    71.3% Calc posteriors
02:12 930Mb    71.8% Calc posteriors
02:13 813Mb    72.4% Calc posteriors
02:14 709Mb    72.9% Calc posteriors
02:15 734Mb    73.6% Calc posteriors
02:16 744Mb    74.1% Calc posteriors
02:17 753Mb    74.7% Calc posteriors
02:18 754Mb    75.3% Calc posteriors
02:19 765Mb    75.7% Calc posteriors
02:20 763Mb    76.3% Calc posteriors
02:21 771Mb    76.9% Calc posteriors
02:22 784Mb    77.5% Calc posteriors
02:23 791Mb    78.2% Calc posteriors
02:24 789Mb    78.8% Calc posteriors
02:25 792Mb    79.4% Calc posteriors
02:26 806Mb    80.1% Calc posteriors
02:27 805Mb    80.7% Calc posteriors
02:28 789Mb    81.2% Calc posteriors
02:29 802Mb    81.8% Calc posteriors
02:30 815Mb    82.5% Calc posteriors
02:31 687Mb    83.3% Calc posteriors
02:32 697Mb    83.9% Calc posteriors
02:33 700Mb    84.5% Calc posteriors
02:34 718Mb    85.2% Calc posteriors
02:35 741Mb    85.8% Calc posteriors
02:36 770Mb    86.3% Calc posteriors
02:37 763Mb    86.8% Calc posteriors
02:38 761Mb    87.2% Calc posteriors
02:39 762Mb    87.8% Calc posteriors
02:40 773Mb    88.4% Calc posteriors
02:41 789Mb    88.9% Calc posteriors
02:42 804Mb    89.5% Calc posteriors
02:43 808Mb    90.1% Calc posteriors
02:44 811Mb    90.6% Calc posteriors
02:45 835Mb    91.4% Calc posteriors
02:46 860Mb    92.2% Calc posteriors
02:47 884Mb    93.0% Calc posteriors
02:48 897Mb    93.5% Calc posteriors
02:49 897Mb    93.9% Calc posteriors
02:50 899Mb    94.3% Calc posteriors
02:51 897Mb    94.9% Calc posteriors
02:52 906Mb    95.2% Calc posteriors
02:53 783Mb    95.5% Calc posteriors
02:54 784Mb    96.0% Calc posteriors
02:55 784Mb    96.3% Calc posteriors
02:56 788Mb    96.9% Calc posteriors
02:57 792Mb    97.3% Calc posteriors
02:58 795Mb    97.8% Calc posteriors
02:59 815Mb    98.2% Calc posteriors
03:00 777Mb    98.4% Calc posteriors
03:01 776Mb    98.9% Calc posteriors
03:02 792Mb    99.3% Calc posteriors
03:03 798Mb    99.8% Calc posteriors
03:03 799Mb   100.0% Calc posteriors
## 03:03 799Mb   0.0043% Consistency (1/2)
03:04 835Mb     1.3% Consistency (1/2) 
03:05 841Mb     5.0% Consistency (1/2)
03:06 799Mb     8.9% Consistency (1/2)
03:07 799Mb    13.8% Consistency (1/2)
03:08 803Mb    19.7% Consistency (1/2)
03:09 815Mb    26.8% Consistency (1/2)
03:10 825Mb    32.8% Consistency (1/2)
03:11 834Mb    37.9% Consistency (1/2)
03:12 843Mb    43.0% Consistency (1/2)
03:13 790Mb    48.0% Consistency (1/2)
03:14 793Mb    53.0% Consistency (1/2)
03:15 806Mb    60.8% Consistency (1/2)
03:16 820Mb    68.6% Consistency (1/2)
03:17 820Mb    73.1% Consistency (1/2)
03:18 830Mb    81.0% Consistency (1/2)
03:19 845Mb    89.6% Consistency (1/2)
03:20 858Mb    97.2% Consistency (1/2)
03:20 863Mb   100.0% Consistency (1/2)
## 03:20 863Mb   0.0043% Consistency (2/2)
03:21 863Mb     5.4% Consistency (2/2) 
03:22 863Mb    14.9% Consistency (2/2)
03:23 860Mb    19.5% Consistency (2/2)
03:24 860Mb    28.0% Consistency (2/2)
03:25 860Mb    37.0% Consistency (2/2)
03:26 860Mb    46.3% Consistency (2/2)
03:27 860Mb    55.7% Consistency (2/2)
03:28 859Mb    62.5% Consistency (2/2)
03:29 854Mb    69.9% Consistency (2/2)
03:30 854Mb    78.8% Consistency (2/2)
03:31 852Mb    88.2% Consistency (2/2)
03:32 852Mb    97.5% Consistency (2/2)
03:32 851Mb   100.0% Consistency (2/2)
## 03:32 850Mb    0.46% UPGMA5           
03:32 850Mb   100.0% UPGMA5
## 03:32 850Mb     1.0% Refining
03:33 848Mb    11.0% Refining
03:34 816Mb    24.0% Refining
03:35 815Mb    45.0% Refining
03:36 818Mb    67.0% Refining
03:37 819Mb    90.0% Refining
03:37 818Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 8 seqs, avg length 322, max 326
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.4Mb     3.6% Calc posteriors
00:00 80Mb    100.0% Calc posteriors
## 00:00 84Mb      3.6% Consistency (1/2)
00:00 84Mb    100.0% Consistency (1/2)
## 00:00 84Mb      3.6% Consistency (2/2)
00:00 84Mb    100.0% Consistency (2/2)
## 00:00 84Mb     14.3% UPGMA5           
00:00 84Mb    100.0% UPGMA5
## 00:00 84Mb      1.0% Refining
00:00 85Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 118 seqs, avg length 370, max 370
## 
## 00:00 2.8Mb  CPU has 8 cores, running 8 threads
## 00:00 3.1Mb   0.014% Calc posteriors
00:01 83Mb     0.25% Calc posteriors
00:02 168Mb     2.1% Calc posteriors
00:03 215Mb     4.5% Calc posteriors
00:04 227Mb     6.9% Calc posteriors
00:05 242Mb     9.2% Calc posteriors
00:06 247Mb    11.7% Calc posteriors
00:07 251Mb    14.0% Calc posteriors
00:08 267Mb    15.7% Calc posteriors
00:09 285Mb    17.9% Calc posteriors
00:10 287Mb    20.1% Calc posteriors
00:11 302Mb    22.3% Calc posteriors
00:12 311Mb    24.4% Calc posteriors
00:13 327Mb    25.7% Calc posteriors
00:14 342Mb    27.7% Calc posteriors
00:15 382Mb    29.5% Calc posteriors
00:16 391Mb    31.6% Calc posteriors
00:17 409Mb    33.6% Calc posteriors
00:18 435Mb    35.9% Calc posteriors
00:19 437Mb    37.2% Calc posteriors
00:20 468Mb    39.5% Calc posteriors
00:21 478Mb    41.7% Calc posteriors
00:22 501Mb    44.2% Calc posteriors
00:23 527Mb    46.6% Calc posteriors
00:24 568Mb    49.2% Calc posteriors
00:25 576Mb    51.9% Calc posteriors
00:26 613Mb    54.4% Calc posteriors
00:27 628Mb    56.9% Calc posteriors
00:28 651Mb    59.5% Calc posteriors
00:29 676Mb    62.0% Calc posteriors
00:30 703Mb    64.6% Calc posteriors
00:31 720Mb    66.4% Calc posteriors
00:32 759Mb    69.1% Calc posteriors
00:33 815Mb    71.8% Calc posteriors
00:34 842Mb    74.2% Calc posteriors
00:35 858Mb    77.2% Calc posteriors
00:36 881Mb    80.2% Calc posteriors
00:37 883Mb    83.2% Calc posteriors
00:38 905Mb    86.4% Calc posteriors
00:39 799Mb    89.3% Calc posteriors
00:40 842Mb    92.2% Calc posteriors
00:41 858Mb    95.1% Calc posteriors
00:42 862Mb    97.8% Calc posteriors
00:42 869Mb   100.0% Calc posteriors
## 00:42 869Mb   0.014% Consistency (1/2)
00:43 871Mb     4.2% Consistency (1/2)
00:44 906Mb    73.4% Consistency (1/2)
00:44 919Mb   100.0% Consistency (1/2)
## 00:44 919Mb   0.014% Consistency (2/2)
00:45 919Mb    46.4% Consistency (2/2)
00:45 919Mb   100.0% Consistency (2/2)
## 00:45 919Mb    0.85% UPGMA5           
00:45 919Mb   100.0% UPGMA5
## 00:45 920Mb     1.0% Refining
00:46 922Mb    22.0% Refining
00:47 926Mb    93.0% Refining
00:47 927Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 10 seqs, avg length 331, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.4Mb     2.2% Calc posteriors
00:00 92Mb    100.0% Calc posteriors
## 00:00 92Mb      2.2% Consistency (1/2)
00:00 92Mb    100.0% Consistency (1/2)
## 00:00 92Mb      2.2% Consistency (2/2)
00:00 92Mb    100.0% Consistency (2/2)
## 00:00 92Mb     11.1% UPGMA5           
00:00 92Mb    100.0% UPGMA5
## 00:00 92Mb      1.0% Refining
00:00 93Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 71 seqs, avg length 370, max 370
## 
## 00:00 2.4Mb  CPU has 8 cores, running 8 threads
## 00:00 2.6Mb    0.04% Calc posteriors
00:01 159Mb     3.6% Calc posteriors
00:02 201Mb    11.3% Calc posteriors
00:03 236Mb    19.4% Calc posteriors
00:04 238Mb    27.2% Calc posteriors
00:05 253Mb    34.9% Calc posteriors
00:06 289Mb    42.6% Calc posteriors
00:07 316Mb    50.7% Calc posteriors
00:08 323Mb    58.8% Calc posteriors
00:09 336Mb    67.6% Calc posteriors
00:10 346Mb    75.9% Calc posteriors
00:11 361Mb    83.9% Calc posteriors
00:12 382Mb    92.2% Calc posteriors
00:13 404Mb   100.0% Calc posteriors
00:13 404Mb   100.0% Calc posteriors
## 00:13 404Mb    0.04% Consistency (1/2)
00:13 423Mb   100.0% Consistency (1/2)
## 00:13 423Mb    0.04% Consistency (2/2)
00:13 425Mb   100.0% Consistency (2/2)
## 00:13 425Mb     1.4% UPGMA5           
00:13 425Mb   100.0% UPGMA5
## 00:13 425Mb     1.0% Refining
00:14 426Mb     8.0% Refining
00:14 429Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 3 seqs, avg length 332, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb    33.3% Calc posteriors
00:00 2.5Mb   100.0% Calc posteriors
## 00:00 17Mb     33.3% Consistency (1/2)
00:00 17Mb    100.0% Consistency (1/2)
## 00:00 17Mb     33.3% Consistency (2/2)
00:00 17Mb    100.0% Consistency (2/2)
## 00:00 18Mb     50.0% UPGMA5           
00:00 18Mb    100.0% UPGMA5
## 00:00 18Mb      1.0% Refining
00:00 19Mb    100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.06 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
##       0.12 seconds: SPR round   1 of   2, 101 of 432 nodes
##       0.25 seconds: SPR round   1 of   2, 401 of 432 nodes
##       0.38 seconds: SPR round   2 of   2, 201 of 432 nodes
##       0.49 seconds: SPR round   2 of   2, 401 of 432 nodes
## Total branch-length 2.197 after 0.52 sec
##       0.64 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.74
##       0.85 seconds: Optimizing GTR model, step 2 of 12
##       0.95 seconds: Optimizing GTR model, step 3 of 12
##       1.08 seconds: Optimizing GTR model, step 5 of 12
##       1.21 seconds: Optimizing GTR model, step 8 of 12
##       1.32 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
##       1.45 seconds: ML Lengths 1 of 215 splits
##       1.55 seconds: Site likelihoods with rate category 11 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       1.69 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.81
## Turning off heuristics for final round of ML NNIs (converged)
##       1.81 seconds: ML NNI round 3 of 16, 1 of 215 splits
##       1.92 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
##       2.03 seconds: ML NNI round 3 of 16, 201 of 215 splits, 57 changes (max delta 1.019)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 2.06 (final)
## Optimize all lengths: LogLk = -5298.540 Time 2.12
##       2.26 seconds: ML split tests for    100 of    214 internal splits
##       2.41 seconds: ML split tests for    200 of    214 internal splits
## Total time: 2.44 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.03
## Total time: 0.04 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.04 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
##       0.12 seconds: SPR round   1 of   2, 101 of 234 nodes
##       0.27 seconds: SPR round   2 of   2, 101 of 234 nodes
## Total branch-length 1.596 after 0.35 sec
##       0.38 seconds: ML Lengths 101 of 116 splits
##       0.51 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.53
##       0.62 seconds: Optimizing GTR model, step 4 of 12
##       0.74 seconds: Optimizing GTR model, step 8 of 12
##       0.85 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
##       0.95 seconds: Site likelihoods with rate category 1 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       1.13 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 1.16
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 1.23
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 1.28
## Turning off heuristics for final round of ML NNIs (converged)
##       1.28 seconds: ML NNI round 5 of 14, 1 of 116 splits
##       1.42 seconds: ML NNI round 5 of 14, 101 of 116 splits, 0 changes
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 1.46 (final)
## Optimize all lengths: LogLk = -3814.284 Time 1.52
##       1.69 seconds: ML split tests for    100 of    115 internal splits
## Total time: 1.72 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.01 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.04
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.05 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.05
## Total time: 0.06 seconds Unique: 9/10 Bad splits: 0/6
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.01 seconds
## Refining topology: 25 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs
##       0.11 seconds: SPR round   2 of   2, 101 of 140 nodes
## Total branch-length 0.623 after 0.13 sec
## ML-NNI round 1: LogLk = -2035.235 NNIs 32 max delta 1.30 Time 0.19
##       0.22 seconds: Optimizing GTR model, step 3 of 12
##       0.32 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2003 0.2812 0.3012 0.2172
## GTR rates(ac ag at cg ct gt) 1.3353 1.7821 0.7868 0.4971 1.3891 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.772 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -1916.232 NNIs 8 max delta 0.00 Time 0.46
## Turning off heuristics for final round of ML NNIs (converged)
##       0.46 seconds: ML NNI round 3 of 12, 1 of 69 splits
## ML-NNI round 3: LogLk = -1916.232 NNIs 1 max delta 0.00 Time 0.54 (final)
## Optimize all lengths: LogLk = -1916.232 Time 0.56
## Total time: 0.66 seconds Unique: 71/71 Bad splits: 0/68
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs
## Total branch-length 0.053 after 0.00 sec
## ML-NNI round 1: LogLk = -560.888 NNIs 0 max delta 0.00 Time 0.00
## GTR Frequencies: 0.1960 0.3050 0.2620 0.2370
## GTR rates(ac ag at cg ct gt) 3.2025 7.6580 2.7938 2.4124 1.8012 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.635 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00 (final)
## Optimize all lengths: LogLk = -548.791 Time 0.00
## Total time: 0.01 seconds Unique: 3/3 Bad splits: 0/0

9.6 Reading and plotting trees

Generated trees arre edited using treeio and plotted with ggtree. The trees were rerooted to their respectives Unmutated Common Ancestor (UCA), which in this case is defined as just the V and J gene germlines combined. The gap between V-J is inserted automatically by the alignment method, thus the CDR3 here is not considered for the UCA.

# function to root tree in UCA 
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}

ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)

names(ls) <- lapply(ls, function(x) {
  if (stringr::str_count(x, "LOR") > 1) {
    substr(x, 24, 34)
  }else if (grepl("LC",x)) {
    substr(x, 24, 31)
  }else if (grepl("LOR",x)) {
    substr(x, 24, 28)
  }else{
    substr(x, 24,33)
    }})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)

for(i in seq_along(plots)){
  print(i)
  plot_name <- names(plots)[i]
  
  plots_edit <- plots[[i]]$data %>%
    mutate(new_label = ifelse(grepl("sc_|LOR|E11_TR1-B2-igg_M05494_64_000000000-CRY9F_1_2115_14487_10780",label), gsub("sc_", "",label), ""),
           new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
                                       warn_missing = FALSE),
           new_label = ifelse(grepl("LOR",new_label), new_label, ""),
           name = label,
           timepoint = case_when(grepl("B2-igg",name) ~ "B2",
                                 grepl("B1-igg",name) ~ "B1",
                                 grepl("sc_|LOR",name) ~ "single_cell",
                                 grepl("igm",name) ~ "PV",
                                 TRUE ~ "intersects"),
           name = gsub("sc_", "", label))
  
  
  if(stringr::str_count(plot_name, "LOR") > 1){
  plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name") 
  }else{ 
  plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name") 
  }
 # shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
  colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")

  gg_plot <- ggtree(plots_edit,aes(color = timepoint)) + 
    {if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
      else geom_tippoint(aes(size = duplicated), shape = 18)}+
   # geom_tippoint(aes(size = duplicated), shape = 23) +
    geom_tiplab(aes(label = new_label), hjust = -.2) +
    labs(size = "Count", shape = "Shape", color = "Timepoint") + 
    scale_color_manual(values = colors_timepoints) +
    coord_cartesian(clip = 'off') + 
   # scale_shape_manual(values = shapes_timepoints) +
    scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
    geom_treescale(width = .05)

  print(gg_plot)
  
   ggsave(gg_plot, filename = paste0("../", result.dir,  names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 2, 4), height = ifelse(grepl("LC", plot_name), 2, 6))
  }

## [1] 1

## [1] 2

## [1] 3

## [1] 4

## [1] 5

## [1] 6

10 LOR24 identity search

Query LOR24 amino acid sequence in bulk repertoire sequencing from each animal. Query was done without alignment due to large number of sequences, but string distance was calculated using Levenstein distance.

10.1 Read and calculate identity

The entire dataset was loaded, filtered and used to calculate distances to LOR24. The output was generated and saved in .rds format to avoid recaculation for each run.

processed_data <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")
processed_data <- processed_data %>% select(grp, name, ID, timepoint, new_name, name, V_SHM, VDJ_aa)

LOR24aa <- "QLQLQESGPGLVKSSETLPLTCAVSGDSISSSYWSWIRQAPGKGLEWIGYIYGSGSYSHYNPSLKSRVTLSVDTSKNQFFLRLNSVTVADTAVYYCARGGRGNTYSWNRFDVWGPGVLVTVSS"

# calculate string distance between LOR24 and all the sequences
identity_matrix <- stringdist::stringdist(gsub("\\*", "",processed_data$VDJ_aa), LOR24aa, method = "lv", nthread = 8)
# calculate the percentage of identity, normalize by LOR24 length
processed_data <- processed_data %>% 
  mutate(identity = (1-identity_matrix/nchar(LOR24aa)) * 100) %>%
  select(V_SHM, identity, ID)
         
saveRDS(processed_data, "../data/clonotypes/individualized/lor24_identity_calculation.rds")

10.2 Bulk repertoire identity to LOR24 amino acids sequences

The processed dataset above was used as an input to plot a 2d-histogram. This plot includes the somatic hypermutation percentage vs the identity to LOR24 amino acid sequences. In essence, this plot shows how a set of sequences across different non-human primates were evolving to become more identical to LOR24.

processed_data <- readRDS("../data/clonotypes/individualized/lor24_identity_calculation.rds")

# plot a 2d-histogram of mutations and identity to LOR24
processed_data %>%
  filter(identity >= 70) %>%
  ggplot(aes(x = V_SHM, y = identity)) + 
  geom_bin2d(bins = 30) + 
  scale_fill_viridis_c(trans = "log10") +
  theme_prism() + 
 # scale_y_continuous(expand=c(0,1),limits=c(0,101)) + 
  labs(x = "% Divergence from germline", y = paste0("% Identity to LOR24aa"), 
       fill = "Sequence counts\n(log10)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1,
        legend.title = element_text(size = 12)) +
  facet_wrap(~ID)

ggsave(filename = paste0("../", result.dir,"LOR24_2d-histogram.pdf"))

11 Profiling expressed monoclonal antibodies

11.1 Heatmap of competition between LOR mAbs

LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.

data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
data_comp_auc$IC50_RSV_A2 <- log10(data_comp_auc$IC50_RSV_A2)

# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity", "IC50_RSV_A2")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
                                     "PreF/PostF" = "#CD87F8",
                                     "PostF" = "#92CDD6",
                                     "w.b." = "grey20"), epitope = c(
                                                    "Ø" = "#F3B084",
                                                    "V" = "salmon", 
                                                    "III" = "#A9D08D",
                                                    "IV" = "#FAD0FF",
                                                    "II" = "#FFD966",
                                                    "I" = "#9BC1E6",
                                                    "UK-Pre" = "#C9C9C9",
                                                    "UK-DP" = "#8497B0" ,
                                                    "WB" = "grey95",
                                                    "PostF" = "black",
                                                    "foldon" = "grey40"))

{
  g_heatmap_scale <- data_comp_auc %>%
    select(-all_of(to_remove)) %>%
    mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
    t() %>%
    pheatmap::pheatmap(scale = "none", angle_col = 90, cutree_cols = 8, 
                     clustering_method = "ward.D", 
                     color = viridisLite::viridis(100), 
                     cluster_rows = FALSE, border_color = "grey40", 
                     cluster_cols = TRUE, 
                     legend_breaks = c(0, 0.5, 1, 1.5, max(.)),
                     legend_labels = c("0", "0.5", "1", "1.5", "Normalized\ncompetition\n"),
                     annotation_col = data_comp_auc[,13:14], 
                     annotation_colors = annot_colors, 
                     cellwidth = 5, cellheight = 5, fontsize_row = 5, fontsize_col = 6, fontsize = 6)
 
  ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 18, height = 12, units = "cm")
}

11.2 Multidimensional scaling of competition between LOR mAbs

This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.

11.2.1 Colored by epitope specificity

# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))

data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name") 
to_remove <- c("epitope", "specificity", "IC50_RSV_A2")

# compute MDS
mds_scaled <- data_comp_auc %>%
  select(-all_of(c(to_remove, "name"))) %>%
  scale(center = TRUE, scale = TRUE) %>%
  dist(method = "euclidean") %>%          
  cmdscale() %>%
  as_tibble()

colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
  
p1 <- mds_scaled %>%
  left_join(data_comp_auc) %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_manual(values = annot_colors[["epitope"]]) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1) 

plotly::ggplotly(p1)

ggsave(plot = p1, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-sites.pdf"))

11.2.2 Colored by RSV neutralization

p2 <- mds_scaled %>%
  left_join(data_comp_auc) %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = IC50_RSV_A2)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_viridis_c() +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  labs(color = "IC50 RSV\n(log10)")+
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1) 

plotly::ggplotly(p2)

ggsave(plot = p2, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-neuts.pdf"))

11.3 Processing and merging LOR metadata

mabs_lors <- mabs_lors %>%
  filter(name %in% lor_mabs$well_ID) %>%
  mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
  arrange(mAbs_ID) %>%
  relocate(mAbs_ID)

mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
  mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp), 
                                  mabs_lors[mabs_lors$name == i,]$mAbs_ID, 
                                  NA)
}

col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])         
mabs_clonal_rank <- mabs_clonal_rank %>%  
  mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
  select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
  filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
  arrange(LOR) %>%
  mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
  filter(ID == well_ID)
  
mabs_lors <- mabs_lors %>% 
  mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
         clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
         clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
  relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)

# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2, clonal size sc = 1), LOR40 (not detected B2, clonal size sc = 1 ), LOR42 (same clone as LOR01), LOR94 (same clone as LOR01), LOR96 (not detected B2, clonal size sc = 4)

#mabs_lors %>% write.csv(paste0("../data/single_cell/LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(file = paste0("../data/single_cell/LOR_mAbs_info.csv"), sep = ",")

mabs_lors %>%
  left_join(data_comp_auc %>% tibble::rowid_to_column("mAbs_ID") %>% mutate(mAbs_ID = as.character(name)), by = "mAbs_ID") %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = epitope)) +
  geom_point(size = 3) +
  scale_color_manual(values = annot_colors[["epitope"]]) +
  scale_y_log10() +
  scale_x_log10() +
  labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

11.4 Heavy and light chain paired V genes

merged_mabs_lors <- mabs_lors %>%
  left_join(clonotype_rsv, by = "name") %>%
  left_join(clono_light_chains,suffix = c("_HC","_LC"), by = "name") %>%
  mutate(v_call_HC = V_gene.x) %>%
  fill(v_call_LC)

write.csv(merged_mabs_lors, file = "../data/single_cell/LOR_mAbs_info_full.csv", row.names = F)

merged_mabs_lors %>%
  group_by(v_call_HC, v_call_LC) %>%
  mutate(mabs_counts = n()) %>%
  ggplot(aes(x= v_call_HC, y = v_call_LC, fill = mabs_counts)) +
    geom_tile(color = "black") +
    scale_fill_viridis_c(option = "viridis", direction = 1) +
    scale_y_discrete(position = "right") +
    labs(fill = "mAb counts", x = "IGHV alleles", y = "IGHJ alleles") + 
    coord_equal() +
    ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
     theme(axis.text.x = element_text(angle = 90, size = 6, hjust = 1, 
                                     vjust = 0.5, face = "bold", colour = "black"),
          axis.text.y = element_text(face = "bold", colour = "black", size = 6),
          strip.text = element_text(face = "bold"),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          legend.position =  "right",
          axis.ticks = element_line(size = .5),
          legend.title = element_text())

12 Take environment snapshot

renv::snapshot()

13 SessionInfo

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4    dplyr_1.1.0         ggprism_1.0.4      
##  [4] scales_1.2.1        vegan_2.6-4         lattice_0.20-45    
##  [7] permute_0.9-7       data.table_1.14.8   Biostrings_2.66.0  
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0      IRanges_2.32.0     
## [13] S4Vectors_0.36.0    BiocGenerics_0.44.0 ggpubr_0.6.0       
## [16] ggplot2_3.4.1       rstatix_0.7.2       ggtree_3.6.0       
## [19] treeio_1.22.0       tidyr_1.3.0         jsonlite_1.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] ggVennDiagram_1.2.2    colorspace_2.1-0       ggsignif_0.6.4        
##  [4] ellipsis_0.3.2         class_7.3-21           aplot_0.1.10          
##  [7] rstudioapi_0.14        proxy_0.4-27           farver_2.1.1          
## [10] ggrepel_0.9.3          fansi_1.0.4            xml2_1.3.3            
## [13] splines_4.2.2          cachem_1.0.7           knitr_1.42            
## [16] broom_1.0.4            cluster_2.1.4          pheatmap_1.0.12       
## [19] compiler_4.2.2         httr_1.4.5             backports_1.4.1       
## [22] Matrix_1.5-3           fastmap_1.1.1          lazyeval_0.2.2        
## [25] cli_3.6.0              htmltools_0.5.4        tools_4.2.2           
## [28] gtable_0.3.1           glue_1.6.2             GenomeInfoDbData_1.2.9
## [31] Rcpp_1.0.10            carData_3.0-5          jquerylib_0.1.4       
## [34] vctrs_0.5.2            ape_5.7                svglite_2.1.1         
## [37] nlme_3.1-162           crosstalk_1.2.0        xfun_0.37             
## [40] stringr_1.5.0          rvest_1.0.3            lifecycle_1.0.3       
## [43] zlibbioc_1.44.0        MASS_7.3-58.3          parallel_4.2.2        
## [46] RColorBrewer_1.1-3     yaml_2.3.7             gridExtra_2.3         
## [49] ggfun_0.0.9            yulab.utils_0.0.6      sass_0.4.5            
## [52] stringi_1.7.12         highr_0.10             tidytree_0.4.2        
## [55] e1071_1.7-13           rlang_1.0.6            pkgconfig_2.0.3       
## [58] systemfonts_1.0.4      bitops_1.0-7           evaluate_0.20         
## [61] purrr_1.0.1            sf_1.0-7               patchwork_1.1.2       
## [64] htmlwidgets_1.6.1      labeling_0.4.2         cowplot_1.1.1         
## [67] tidyselect_1.2.0       plyr_1.8.8             magrittr_2.0.3        
## [70] R6_2.5.1               generics_0.1.3         DBI_1.1.3             
## [73] pillar_1.8.1           withr_2.5.0            mgcv_1.8-42           
## [76] units_0.8-1            abind_1.4-5            RCurl_1.98-1.10       
## [79] tibble_3.2.0           crayon_1.5.2           car_3.1-1             
## [82] KernSmooth_2.23-20     utf8_1.2.3             plotly_4.10.1         
## [85] RVenn_1.1.0            rmarkdown_2.20         grid_4.2.2            
## [88] digest_0.6.31          classInt_0.4-9         webshot_0.5.4         
## [91] gridGraphics_0.5-1     munsell_0.5.0          viridisLite_0.4.1     
## [94] ggplotify_0.1.0        bslib_0.4.2

Bioinformatics analysis: Multivalent Antigen Display on Nanoparticles Elicits Diverse and Broad B cell Responses

Rodrigo Arcoverde Cerveira

2023-07-25