1 Abstract

This Rmarkdown provides the analysis for the processed data in the paper entitled “Multivalent Antigen Display on Nanoparticles Diversifies B Cell Responses”. It includes the B cell receptors analysis, plots, and generation of intermediate files for plotting using other programs. The processed files used here are AIRR-compliant datasets which can be downloaded from Zenodo 7895251), they are either Human Respiratory Syncytial Virus-specific (RSV-specific) or the full bulk repertoire from the vaccinated non-human primates (Macaca mulatta).

2 Needed libraries

library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)

3 Load data

The data used here is stored as .rds format or as a table (.csv or .tsv) and it can be found in the Zenodo repository 7895251).

data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
  mutate(group = case_when(group == "NP 100%" ~ "20-mer",
                           group == "NP 50%" ~ "10-mer",
                           group == "Sol" ~ "1-mer",
                           group == "PostF" ~ "PostF"))

clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")


# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
  filter(timepoint != "PV") %>%
  mutate(cdr3_aa_length = nchar(CDR3_aa),
         cdr3_aa = CDR3_aa,
         v_call = V_gene,
         j_call = J_gene,
         d_call = D_gene) 

# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
 
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))

# load mAbs data
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")

# load competition ELISA data
# edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-04-28_LOR_norm-dAUC.csv", header = T, row.names = 1, encoding="UTF-8") %>%
  replace(is.na(.), 0) %>%
  mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF", "w.b.")),
         epitope = factor(epitope, levels = c("Ø","V", "IV","III", "II","I", "foldon", "UK-Pre","UK-DP","PostF", "WB")))

# load light chains information clonotyped
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE) 

4 Plasma profiling

The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Here we show this data using different visualization methods.

4.1 Multidimensional scaling (MDS)

Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.

cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>% 
  select(-c(timepoints, ID, group)) %>%
  as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)

mds <- D_sim %>%       
  cmdscale(3) %>%
  as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")

mds <- mds %>%
  mutate(group = x$group,
         timepoints = x$timepoints,
         ID = x$ID)
return(mds)
}

4.2 Plasma profiling with MDS divided by timepoint

The code below plots 4 different MDS plots. They are divided between 2 weeks after Boost 1 (B1) and Boost 2 (B2), with or without animals vaccinated with PostF immunogen. On the top of each plot, it is written B1 or B2, and legends say if PostF animals are included or not. The cosine distance was calculated separately using the custom cosine_distance() provided above, for that reason, position of points should not be compared directly between each plot but rather with points within each plot.

# calculate cosine distance and generate MDS with and without PostF
mds_b1 <- data_comp %>% filter(timepoints == "B1") %>% cosine_distance()
mds_b2 <- data_comp %>% filter(timepoints == "B2") %>% cosine_distance()
mds_no_postf_b1 <- data_comp %>% filter(group != "PostF", timepoints == "B1") %>% cosine_distance()
mds_no_postf_b2 <- data_comp %>% filter(group != "PostF", timepoints == "B2") %>% cosine_distance()
ls_mds <- list(mds_b1, mds_b2, mds_no_postf_b1, mds_no_postf_b2)

plot_list <- list()
for(f in seq_along(ls_mds)){
# Plot and color by groups
    mds_plot <- ls_mds[[f]]
    if(f %in% c(2,4)){
      mds_plot$Dim.1 <- mds_plot$Dim.1 * -1 # flip axis in first dimension
    }
    
    plot <- mds_plot %>%
      ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) + 
      geom_point(size = 4, shape = 21) +
      scale_fill_manual(values = fill_col_values) +
      scale_color_manual(values = color_values) +
      lims(x = c(-.5,.5), y = c(-.4, .4)) + 
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      theme(axis.ticks = element_line(size = .5),
            legend.position = c(.2,.2),
            legend.background = element_blank(),
            legend.box.background = element_rect(colour = "black")) +
      facet_wrap(~timepoints)
    
    plot_list[[f]] <- plot
    ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 4, height = 4)
  }

ggarrange(plot_list[[3]], plot_list[[4]], plot_list[[1]], plot_list[[2]], ncol = 2, nrow = 2, common.legend = TRUE,legend = "top", legend.grob = get_legend(plot_list[[2]] + theme(legend.position = "top")))

4.3 Competition titers as bar plots

The code below generates bar plots to visualize the competition titers for epitopes on the PreF or PostF RSV antigen. The code starts by processing the data and categorizing the epitopes based on specific markers. It then defines various pairwise comparisons for statistical testing between different groups for each timepoint with fdr correction. Afterward, it performs a Wilcoxon rank-sum test and displays the results in a tabular format. Finally, the code creates a series of bar plots for each epitope, showing the competition titers at different timepoints, with data points represented as colored points, and a summary of the mean values indicated by crossbars. The plots are grouped based on the specific epitopes, providing an insightful visualization of the competition titer dynamics for different epitopes on the PreF antigen.

The code focused solely on the statistical tests, analysis, and p-value correction can be seen on the rsv_statistical_analysis.Rmd or rendered as a website here.

4.3.1 Competition for epitopes on Pref

data_comp_longer <- data_comp %>%
  pivot_longer(cols = 2:9, names_to = "mAb", values_to = "ELISA_competition") %>%
  mutate(epitopes = plyr::mapvalues(mAb, 
                                    from = c("D25.PreF", "MPE8.PreF", "ADI.PreF", "Pali.PreF", "X101F.PreF",
                                             "ADI.PostF", "X101F.PostF", "Pali.PostF"), 
                                    to = c("Ø", "III", "I", "II", "IV",
                                           "I", "IV", "II")),
         epitopes = factor(epitopes, levels = c("Ø", "III", "IV", "II", "I")),
         conformation = factor(case_when(grepl("PostF", mAb) ~ "PostF",
                                  grepl("PreF", mAb) ~ "PreF"), levels = c("PreF", "PostF")),
        timepoint_group = paste(timepoints, group, sep = "_"),
        timepoint_group_epitope = paste(timepoints, group, epitopes, sep = "_"))

my_comparisons_sites <- list(c("B1_1-mer", "B1_20-mer"),
                       c("B1_1-mer", "B1_10-mer"),
                       c("B1_1-mer", "B1_PostF"),
                       c("B1_10-mer", "B1_20-mer"),
                       c("B1_10-mer", "B1_PostF"),
                       c("B1_20-mer", "B1_PostF"),
                       c("B2_1-mer", "B2_20-mer"),
                       c("B2_1-mer", "B2_10-mer"),
                       c("B2_1-mer", "B2_PostF"),
                       c("B2_10-mer", "B2_20-mer"),
                       c("B2_10-mer", "B2_PostF"),
                       c("B2_20-mer", "B2_PostF"))

my_comparisons_1 <- lapply(my_comparisons_sites, paste0, "_I")
my_comparisons_2 <- lapply(my_comparisons_sites, paste0, "_II")
my_comparisons_3 <- lapply(my_comparisons_sites, paste0, "_III")
my_comparisons_4 <- lapply(my_comparisons_sites, paste0, "_IV")
my_comparisons_5 <- lapply(my_comparisons_sites, paste0, "_Ø")

# Selecting sites for statistical comparison only between groups for each timepoint
# removed statistical comparisons for Boost 1 site I in PreF since most values were below the threshold of detection
my_comparisons_preF <- c(my_comparisons_1[7:12], my_comparisons_2, my_comparisons_3, my_comparisons_4, my_comparisons_5)
my_comparisons_postF <- c(my_comparisons_1, my_comparisons_2, my_comparisons_4)
  

stat.test <- data_comp_longer %>%
  filter(conformation == "PreF") %>%
  wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope, 
              paired = FALSE, 
              p.adjust.method = "fdr", 
              comparisons = my_comparisons_preF)

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
ELISA_competition B2_1-mer_I B2_20-mer_I 4 5 14.0 0.413 0.812 ns
ELISA_competition B2_1-mer_I B2_10-mer_I 4 5 20.0 0.020 0.176 ns
ELISA_competition B2_1-mer_I B2_PostF_I 4 4 0.0 0.029 0.190 ns
ELISA_competition B2_10-mer_I B2_20-mer_I 5 5 6.0 0.205 0.527 ns
ELISA_competition B2_10-mer_I B2_PostF_I 5 4 0.0 0.020 0.176 ns
ELISA_competition B2_20-mer_I B2_PostF_I 5 4 2.0 0.064 0.343 ns
ELISA_competition B1_1-mer_II B1_20-mer_II 4 5 3.0 0.111 0.428 ns
ELISA_competition B1_1-mer_II B1_10-mer_II 4 5 8.0 0.730 0.939 ns
ELISA_competition B1_1-mer_II B1_PostF_II 4 4 12.0 0.343 0.772 ns
ELISA_competition B1_10-mer_II B1_20-mer_II 5 5 6.0 0.222 0.545 ns
ELISA_competition B1_10-mer_II B1_PostF_II 5 4 16.0 0.190 0.513 ns
ELISA_competition B1_20-mer_II B1_PostF_II 5 4 19.0 0.032 0.190 ns
ELISA_competition B2_1-mer_II B2_20-mer_II 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_II B2_10-mer_II 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_II B2_PostF_II 4 4 6.0 0.686 0.939 ns
ELISA_competition B2_10-mer_II B2_20-mer_II 5 5 9.0 0.548 0.883 ns
ELISA_competition B2_10-mer_II B2_PostF_II 5 4 6.0 0.413 0.812 ns
ELISA_competition B2_20-mer_II B2_PostF_II 5 4 9.0 0.905 0.958 ns
ELISA_competition B1_1-mer_III B1_20-mer_III 4 5 7.0 0.556 0.883 ns
ELISA_competition B1_1-mer_III B1_10-mer_III 4 5 8.0 0.730 0.939 ns
ELISA_competition B1_1-mer_III B1_PostF_III 4 4 13.5 0.124 0.446 ns
ELISA_competition B1_10-mer_III B1_20-mer_III 5 5 13.0 1.000 1.000 ns
ELISA_competition B1_10-mer_III B1_PostF_III 5 4 20.0 0.018 0.176 ns
ELISA_competition B1_20-mer_III B1_PostF_III 5 4 20.0 0.018 0.176 ns
ELISA_competition B2_1-mer_III B2_20-mer_III 4 5 4.0 0.190 0.513 ns
ELISA_competition B2_1-mer_III B2_10-mer_III 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_III B2_PostF_III 4 4 11.0 0.486 0.883 ns
ELISA_competition B2_10-mer_III B2_20-mer_III 5 5 8.0 0.421 0.812 ns
ELISA_competition B2_10-mer_III B2_PostF_III 5 4 11.0 0.905 0.958 ns
ELISA_competition B2_20-mer_III B2_PostF_III 5 4 16.0 0.190 0.513 ns
ELISA_competition B1_1-mer_IV B1_20-mer_IV 4 5 4.0 0.190 0.513 ns
ELISA_competition B1_1-mer_IV B1_10-mer_IV 4 5 6.0 0.413 0.812 ns
ELISA_competition B1_1-mer_IV B1_PostF_IV 4 4 8.0 1.000 1.000 ns
ELISA_competition B1_10-mer_IV B1_20-mer_IV 5 5 11.0 0.841 0.958 ns
ELISA_competition B1_10-mer_IV B1_PostF_IV 5 4 13.0 0.556 0.883 ns
ELISA_competition B1_20-mer_IV B1_PostF_IV 5 4 13.0 0.556 0.883 ns
ELISA_competition B2_1-mer_IV B2_20-mer_IV 4 5 8.0 0.730 0.939 ns
ELISA_competition B2_1-mer_IV B2_10-mer_IV 4 5 11.0 0.905 0.958 ns
ELISA_competition B2_1-mer_IV B2_PostF_IV 4 4 7.0 0.886 0.958 ns
ELISA_competition B2_10-mer_IV B2_20-mer_IV 5 5 4.0 0.095 0.428 ns
ELISA_competition B2_10-mer_IV B2_PostF_IV 5 4 5.0 0.286 0.671 ns
ELISA_competition B2_20-mer_IV B2_PostF_IV 5 4 12.0 0.730 0.939 ns
ELISA_competition B1_1-mer_Ø B1_20-mer_Ø 4 5 11.0 0.898 0.958 ns
ELISA_competition B1_1-mer_Ø B1_10-mer_Ø 4 5 12.0 0.701 0.939 ns
ELISA_competition B1_1-mer_Ø B1_PostF_Ø 4 4 12.0 0.186 0.513 ns
ELISA_competition B1_10-mer_Ø B1_20-mer_Ø 5 5 10.0 0.666 0.939 ns
ELISA_competition B1_10-mer_Ø B1_PostF_Ø 5 4 16.0 0.109 0.428 ns
ELISA_competition B1_20-mer_Ø B1_PostF_Ø 5 4 16.0 0.109 0.428 ns
ELISA_competition B2_1-mer_Ø B2_20-mer_Ø 4 5 13.0 0.556 0.883 ns
ELISA_competition B2_1-mer_Ø B2_10-mer_Ø 4 5 12.0 0.730 0.939 ns
ELISA_competition B2_1-mer_Ø B2_PostF_Ø 4 4 16.0 0.029 0.190 ns
ELISA_competition B2_10-mer_Ø B2_20-mer_Ø 5 5 13.0 1.000 1.000 ns
ELISA_competition B2_10-mer_Ø B2_PostF_Ø 5 4 20.0 0.020 0.176 ns
ELISA_competition B2_20-mer_Ø B2_PostF_Ø 5 4 20.0 0.020 0.176 ns
data_comp_longer %>%
  filter(conformation == "PreF") %>%
  ggplot(aes(x = timepoint_group,
            y = ELISA_competition,
            color = group,
            fill = group)) +
  geom_point(size = 2, shape = 21) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
  scale_y_log10() +
  stat_summary(fun.y = mean, 
               geom = "crossbar",
               color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed") +
  labs(title = "PreF Antigenic Sites", y = "50% Competition Titer", x = "") +
  theme(axis.ticks = element_line(size = .5),
        legend.background = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
  facet_wrap(~epitopes, nrow = 1) 

4.3.2 Competition for epitopes on Postf

stat.test <- data_comp_longer %>%
  filter(conformation == "PostF") %>%
  wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope, 
              paired = FALSE, 
              p.adjust.method = "fdr", 
              comparisons = my_comparisons_postF) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
ELISA_competition B1_1-mer_I B1_20-mer_I 4 5 0.0 0.020 0.088 ns 6533.097 B1_1-mer…. 1 7
ELISA_competition B1_1-mer_I B1_10-mer_I 4 5 0.0 0.020 0.088 ns 7250.111 B1_1-mer…. 1 4
ELISA_competition B1_1-mer_I B1_PostF_I 4 4 1.0 0.059 0.127 ns 7967.125 B1_1-mer…. 1 10
ELISA_competition B1_10-mer_I B1_20-mer_I 5 5 3.0 0.056 0.127 ns 8684.139 B1_10-me…. 4 7
ELISA_competition B1_10-mer_I B1_PostF_I 5 4 9.0 0.905 0.905 ns 9401.153 B1_10-me…. 4 10
ELISA_competition B1_20-mer_I B1_PostF_I 5 4 12.0 0.730 0.773 ns 10118.168 B1_20-me…. 7 10
ELISA_competition B2_1-mer_I B2_20-mer_I 4 5 0.0 0.016 0.088 ns 10835.182 B2_1-mer…. 13 19
ELISA_competition B2_1-mer_I B2_10-mer_I 4 5 5.0 0.286 0.381 ns 11552.196 B2_1-mer…. 13 16
ELISA_competition B2_1-mer_I B2_PostF_I 4 4 0.0 0.029 0.088 ns 12269.210 B2_1-mer…. 13 22
ELISA_competition B2_10-mer_I B2_20-mer_I 5 5 7.0 0.310 0.385 ns 12986.224 B2_10-me…. 16 19
ELISA_competition B2_10-mer_I B2_PostF_I 5 4 0.0 0.016 0.088 ns 13703.238 B2_10-me…. 16 22
ELISA_competition B2_20-mer_I B2_PostF_I 5 4 3.0 0.111 0.195 ns 14420.252 B2_20-me…. 19 22
ELISA_competition B1_1-mer_II B1_20-mer_II 4 5 3.5 0.125 0.205 ns 15137.266 B1_1-mer…. 2 8
ELISA_competition B1_1-mer_II B1_10-mer_II 4 5 8.5 0.771 0.793 ns 15854.280 B1_1-mer…. 2 5
ELISA_competition B1_1-mer_II B1_PostF_II 4 4 0.0 0.026 0.088 ns 16571.294 B1_1-mer…. 2 11
ELISA_competition B1_10-mer_II B1_20-mer_II 5 5 6.5 0.236 0.327 ns 17288.309 B1_10-me…. 5 8
ELISA_competition B1_10-mer_II B1_PostF_II 5 4 1.0 0.034 0.088 ns 18005.323 B1_10-me…. 5 11
ELISA_competition B1_20-mer_II B1_PostF_II 5 4 2.0 0.064 0.127 ns 18722.337 B1_20-me…. 8 11
ELISA_competition B2_1-mer_II B2_20-mer_II 4 5 1.0 0.032 0.088 ns 19439.351 B2_1-mer…. 14 20
ELISA_competition B2_1-mer_II B2_10-mer_II 4 5 4.0 0.190 0.297 ns 20156.365 B2_1-mer…. 14 17
ELISA_competition B2_1-mer_II B2_PostF_II 4 4 0.0 0.029 0.088 ns 20873.379 B2_1-mer…. 14 23
ELISA_competition B2_10-mer_II B2_20-mer_II 5 5 8.0 0.421 0.474 ns 21590.393 B2_10-me…. 17 20
ELISA_competition B2_10-mer_II B2_PostF_II 5 4 0.0 0.016 0.088 ns 22307.407 B2_10-me…. 17 23
ELISA_competition B2_20-mer_II B2_PostF_II 5 4 2.0 0.064 0.127 ns 23024.421 B2_20-me…. 20 23
ELISA_competition B1_1-mer_IV B1_20-mer_IV 4 5 1.0 0.032 0.088 ns 23741.435 B1_1-mer…. 3 9
ELISA_competition B1_1-mer_IV B1_10-mer_IV 4 5 4.5 0.219 0.320 ns 24458.449 B1_1-mer…. 3 6
ELISA_competition B1_1-mer_IV B1_PostF_IV 4 4 2.0 0.114 0.195 ns 25175.464 B1_1-mer…. 3 12
ELISA_competition B1_10-mer_IV B1_20-mer_IV 5 5 6.0 0.222 0.320 ns 25892.478 B1_10-me…. 6 9
ELISA_competition B1_10-mer_IV B1_PostF_IV 5 4 6.0 0.413 0.474 ns 26609.492 B1_10-me…. 6 12
ELISA_competition B1_20-mer_IV B1_PostF_IV 5 4 13.0 0.556 0.607 ns 27326.506 B1_20-me…. 9 12
ELISA_competition B2_1-mer_IV B2_20-mer_IV 4 5 0.0 0.016 0.088 ns 28043.520 B2_1-mer…. 15 21
ELISA_competition B2_1-mer_IV B2_10-mer_IV 4 5 6.0 0.413 0.474 ns 28760.534 B2_1-mer…. 15 18
ELISA_competition B2_1-mer_IV B2_PostF_IV 4 4 0.0 0.029 0.088 ns 29477.548 B2_1-mer…. 15 24
ELISA_competition B2_10-mer_IV B2_20-mer_IV 5 5 7.0 0.310 0.385 ns 30194.562 B2_10-me…. 18 21
ELISA_competition B2_10-mer_IV B2_PostF_IV 5 4 0.0 0.016 0.088 ns 30911.576 B2_10-me…. 18 24
ELISA_competition B2_20-mer_IV B2_PostF_IV 5 4 3.0 0.111 0.195 ns 31628.590 B2_20-me…. 21 24
data_comp_longer %>%
  filter(conformation == "PostF") %>%
  ggplot(aes(x = timepoint_group,
            y = ELISA_competition,
            color = group,
            fill = group)) +
  geom_point(size = 2, shape = 21) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
  scale_y_log10() +
  stat_summary(fun.y = mean, 
               geom = "crossbar",
               color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed") +
  labs(title = "PostF Antigenic Sites", y = "50% Competition Titer", x = "") +
  theme(axis.ticks = element_line(size = .5),
        legend.background = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
  facet_wrap(~epitopes, nrow = 1) 

5 Save metadata from B cell repertoire sequencing

Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.

rep_seq_stat <-  lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <-  lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE) 
rep_seq_stat <- rep_seq_stat %>%
  select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))

write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )

6 Comparing repertoires

6.1 Processing clonotype datasets

After merging the antigen-specific sequences and bulk sequencing, we run the the clonotype module from IgDiscover to assign clonotype grouping to each sequence. We have done that for both the individualized germline and the KIMDB database. Here we are reading those datasets and doing wrangling the data in the required format.

ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls) & grepl("individualized|nestor-rm", ls) ]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
  select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
  mutate(
    Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
    ID = gsub("_.*", "", ID_timepoint),
    database = plyr::mapvalues(.id,
      from = c("nestor-rm", "individualized"),
      to = c("KIMDB", "Individualized")
    ),
    database = factor(database, levels = c("KIMDB", "Individualized")),
    Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
    v_call = ifelse(is.na(v_call), V_gene, v_call),
    j_call = ifelse(is.na(j_call), J_gene, j_call)
  ) %>%
  group_by(.id, ID_timepoint) %>%
  distinct(new_name, .keep_all = TRUE) %>%
  ungroup() %>%
  filter(database %in% c("KIMDB", "Individualized"))

6.2 Plot number of sequences per animal

Plotting sequencing depth for each timepoint per animal. This indicates that the sequencing depth for Boost 1 was lower, thus for that reason all the analysis were normalized by sequencing depth to take that into account.

# samples from the biological replicates after boost 2 (BR2-B2) from E17 were merged with the technical replicates after boost 2 (TR1-B2)
# this was done due to low sequencing depth for that animal and large expansion of uncharacterized clones 
full_rep_indiv <- read.table("../data/processed_data/summary_sequencing_table.tsv", 
                             header = TRUE) %>%
  separate(ID, sep = "_", into = c("sample", "ID")) %>%
  filter(sample != "TR2-B2") %>%
  mutate(sample = ifelse(sample == "BR2-B2", "TR1-B2", sample),
         timepoint = case_when(sample == "igm" ~ "PV",
                            grepl("B1", sample) ~ "B1",
                            grepl("B2", sample) ~ "B2"),
         timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
  group_by(timepoint, ID) %>%
  summarise(across(where(is.numeric), sum)) %>%
  ungroup()
  
# since here we wanted to compare all groups, we used Dunn's test
stat.test <- full_rep_indiv %>%
              dunn_test(formula = assignment_filtering.has_cdr3 ~ timepoint,
                          p.adjust.method = "fdr") %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
assignment_filtering.has_cdr3 PV B1 9 9 -3.6525703 0.0002596 0.0007789 *** 1393265 PV, B1 1 2
assignment_filtering.has_cdr3 PV B2 9 9 -0.2672612 0.7892680 0.7892680 ns 1491771 PV, B2 1 3
assignment_filtering.has_cdr3 B1 B2 9 9 3.3853091 0.0007110 0.0010665 ** 1590276 B1, B2 2 3
full_rep_indiv %>%
  ggdotplot(y = "assignment_filtering.has_cdr3", x = "timepoint", 
            group = "timepoint", 
            fill = "ID", 
            size = 1,) +
  geom_boxplot(alpha = .2, outlier.shape = NA) +
  ggpubr::stat_compare_means(method = "kruskal") +
  labs(x = "Timepoint", y = "# good quality aligned sequences") +
  scale_fill_viridis_d() +
  stat_pvalue_manual(stat.test %>% mutate(p.adj = round(p.adj, 4)), label = "p.adj") +
   ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)

write.csv(full_rep_indiv, paste0("../", result.dir, "repertoire_depth.csv"), row.names = F)
ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)

6.3 Generate data for clone size per animal per database

rds_summary <- rds_merge %>%
  group_by(database, ID_timepoint, grp) %>%
  summarise(
    ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
    clonal_size = n(),
    database,
    sc_clone_grp = first(sc_clone_grp),
    V_errors = mean(V_errors),
    specificity_group = first(specificity_group),
    cdr3_aa_length = mean(cdr3_aa_length),
    v_call = first(v_call),
    j_call = first(j_call)
  ) %>%
  ungroup() %>%
  group_by(database, ID_timepoint) %>%
  mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
  ungroup() %>%
  distinct()

rds_summary_noPV <- rds_summary %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell")

rds_summary_save <- rds_summary %>%
  filter(Timepoint %in% c("Single-cell", "B1","B2"),
         database == "Individualized") %>%
  group_by(ID_timepoint) %>%
  summarise(mean_clonal_size = mean(clonal_size),
            median_clonal_size = median(clonal_size),
            geom_clonal_size = exp(mean(log(clonal_size))),
            unique_clones = sum(clonal_size == 1),
            total_clones_detected = n(),
            percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2)) 

rds_summary_save %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
ID_timepoint mean_clonal_size median_clonal_size geom_clonal_size unique_clones total_clones_detected percentage_unique_clones
E11_B1 73.765766 24.0 23.618580 7 111 6.31
E11_B2 65.842324 18.0 16.139896 22 241 9.13
E11_Single-cell 1.596491 1.0 1.310902 169 228 74.12
E12_B1 96.802632 8.5 11.635387 5 76 6.58
E12_B2 62.696296 9.0 9.733452 18 135 13.33
E12_Single-cell 1.556000 1.0 1.304827 185 250 74.00
E14_B1 79.458333 9.0 9.683393 11 96 11.46
E14_B2 111.530612 10.5 12.335661 29 196 14.80
E14_Single-cell 2.041096 1.0 1.527244 139 219 63.47
E16_B1 73.535519 14.0 15.223779 14 183 7.65
E16_B2 118.491228 31.5 29.123260 17 228 7.46
E16_Single-cell 1.996094 1.0 1.489967 169 256 66.02
E17_B1 38.460526 8.0 9.753009 8 76 10.53
E17_B2 55.614035 6.0 7.902730 31 171 18.13
E17_Single-cell 1.665158 1.0 1.381309 151 221 68.33
E18_B1 41.323529 11.0 11.772174 9 102 8.82
E18_B2 72.662921 21.0 15.643111 23 178 12.92
E18_Single-cell 1.641026 1.0 1.368241 135 195 69.23
E21_B1 35.201835 5.0 6.887328 21 109 19.27
E21_B2 45.202312 14.0 13.455144 16 173 9.25
E21_Single-cell 1.674208 1.0 1.385378 151 221 68.33
E23_B1 59.675497 9.0 11.417886 19 151 12.58
E23_B2 63.044025 13.0 14.180265 18 159 11.32
E23_Single-cell 1.916279 1.0 1.484490 136 215 63.26
E24_B1 55.469231 10.0 11.115858 18 130 13.85
E24_B2 62.310735 9.0 10.429075 29 177 16.38
E24_Single-cell 1.829897 1.0 1.400170 137 194 70.62
write.csv(rds_summary_save, 
          paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)

6.4 Process and merge dataset with metadata

rds_indiv <- rds_summary %>%
  filter(
    database == "Individualized",
    Timepoint != "Single-cell",
    Timepoint != "PV"
  ) %>%
  mutate(
    LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
    LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
  )
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
  tidyr::expand(ID, Timepoint, grp) %>%
  filter(Timepoint != "Single-cell")

rds_indiv <- rds_indiv %>%
  right_join(all) %>%
  mutate(
    ID_timepoint = paste(ID, Timepoint, sep = "_"),
    database = "individualized",
    clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
  ) %>%
  arrange(grp, ID_timepoint) %>%
  tidyr::fill(LOR, Group, sc_clone_grp)

6.5 Cumulative area clone size plots

generates cumulative area plots for clone sizes of 20-mer and 1-mer specificities, further divided by specificity groups (DP & PreF). The code starts by summarizing the clone sizes for each combination of “Group,” “Timepoint,” and “specificity_group” from the “rds_indiv” data. The specificity group “PostF” is filtered out for this plot. The data is then plotted using ggplot2 library to create area plots with the x-axis representing “Timepoint,” the y-axis representing the cumulative “Clonal size”, and the area filled based on “specificity_group” (DP & PreF). Each “Group” is presented as a facet, showing the cumulative growth of clone sizes for the specified specificities over different timepoints. The area plots offer a visual representation of how the clone sizes evolve over time, highlighting differences between the two specificity groups (DP & PreF) for both 20-mer and 1-mer clones.

6.5.1 Clone size area plot (cumulative) of 20-mer and 1-mer, divided by specificity (DP & PreF)

rds_indiv %>%
  group_by(Group, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group)

ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)

6.5.2 Clone size area plot (cumulative) per animal, divided by specificity (DP & PreF)

rds_indiv %>%
  group_by(Group, ID, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 30000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group + ID, ncol = 5)

6.6 Correlation clone size per animal per database

The provided code plots the clonal sizes from two databases, “KIMDB” and “Individualized”, for different timepoints. It creates scatter plots with clonal sizes from the “Individualized” database on the x-axis and “KIMDB” database on the y-axis, allowing visual comparison and assessment of correlation. The dashed line with a slope of 1 and an intercept of 0 represents the ideal correlation line. The plots are grouped by timepoints, providing a concise visual understanding of the clonal size relationship between the two databases over time.

df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
  for (databases in unique(rds_summary_noPV$database)) {
    rds_timepoint <- rds_summary_noPV %>%
      filter(Timepoint == timepoints, database == databases) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    rds_indiv_corr <- rds_summary_noPV %>%
      filter(
        database == "Individualized",
        Timepoint == timepoints
      ) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    length(rds_indiv_corr) <- length(rds_timepoint)

    df <- data.frame(
      size_indiv = rds_indiv_corr,
      clonal_size = rds_timepoint,
      Timepoint = timepoints,
      database = databases
    )
    df[is.na(df)] <- 0
    df_all <- rbind(df, df_all)
  }
}


df_all %>%
  filter(database != "Individualized") %>%
  mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
  ggplot(aes(x = size_indiv, y = clonal_size)) +
  geom_point(size = 1) +
  geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
  scale_x_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  scale_y_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~ Timepoint)

ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)

6.7 IGHV-J pairing per database

The presented R code analyzes IGHV-J pairing for two databases, “Individualized” and “KIMDB.” The code preprocesses data from the “rds_summary” dataframe, extracting V and J gene calls, and calculating clonal size percentages for each V-J combination. It then creates a heatmap plot (plot_vj) with IGHV alleles on the x-axis and IGHJ alleles on the y-axis, using the fill color to represent the clonal size mean percentage for each V-J pairing. The heatmap is grouped by specificity groups (“Total,” “PreF,” and “DP”) and clonotype type (“20-mer” and “1-mer”) for each database, providing a compact visual comparison of the IGHV-J pairing distributions between the two databases.

for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
  # check if we want to filter clonotypes by size or not
  filter(database == i) %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  mutate(v_call = sub("\\*.*","", v_call),
         j_call = sub("\\*.*|-.*","", j_call),
         clonal_size_log = log10(clonal_size),
         Group = factor(Group, levels = c("20-mer", "1-mer")),
         specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
  group_by(ID) %>%
  mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
  ungroup() %>% 
  group_by(Group, specificity_group, v_call, j_call) %>%
  summarise(clonal_size_mean_perc = mean(clonal_size_perc), clonal_size_mean = mean(clonal_size)) %>%
  filter(specificity_group != "PostF") %>% droplevels()

plot_vj <-  rds_summary_noPV %>%
    ggplot(aes(x= v_call, y = j_call, fill = clonal_size_mean_perc)) +
      geom_tile(color = "black") +
      scale_fill_viridis_c(option = "viridis", direction = 1, limits = c(0, 1.5), breaks = c(0,0.5, 1, 1.5)) +
      scale_y_discrete(position = "right") +
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      coord_equal() +
      theme(axis.text.x = element_text(angle = 90, size = 5, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
            axis.text.y = element_text(face = "bold", colour = "black", size = 5),
            strip.text = element_text(face = "bold", size = 7),
            panel.grid.major = element_blank(), 
            panel.grid.minor = element_blank(),
            legend.position =  "top",
            axis.ticks = element_line(size = .5),
            legend.title = element_text()) +
      labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles", title = paste0(i, " Database")) +
      facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")

print(plot_vj)

ggsave(plot = plot_vj, filename = paste0("../",result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}

6.8 Count unique V-J pairs

The provided R code counts and compares unique V-J pairing occurrences in the “Individualized” database. It preprocesses the data, calculates the number of unique V-J pairs for each group, and performs statistical tests. The results are visualized using a dot plot with mean values and p-value labels, enabling easy comparison of unique V-J pairing counts among different groups.

# groups to perform statistical comparisons
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
                       c("Total_B2_20-mer", "Total_B2_1-mer"),
                       c("PreF_B1_20-mer", "PreF_B1_1-mer"),
                       c("PreF_B2_20-mer", "PreF_B2_1-mer"),
                       c("DP_B1_20-mer", "DP_B1_1-mer"),
                       c("DP_B2_20-mer", "DP_B2_1-mer"))

rds_summary_noPV <- rds_summary %>%
  # If want to remove clonal sizer < 1, do it here.
  filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) %>%
  summarise(unique_v_j = n()) %>%
  ungroup()

rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)

stat.test <- rds_summary_noPV %>%
              wilcox_test(formula = unique_v_j ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
unique_v_j Total_B1_20-mer Total_B1_1-mer 5 4 14 0.413 0.496 ns 137.040 Total_B1…. 1 2
unique_v_j Total_B2_20-mer Total_B2_1-mer 5 4 11 0.901 0.901 ns 150.288 Total_B2…. 3 4
unique_v_j PreF_B1_20-mer PreF_B1_1-mer 5 4 4 0.176 0.264 ns 163.536 PreF_B1_…. 5 6
unique_v_j PreF_B2_20-mer PreF_B2_1-mer 5 4 0 0.016 0.048
176.784 PreF_B2_…. 7 8
unique_v_j DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
190.032 DP_B1_20…. 9 10
unique_v_j DP_B2_20-mer DP_B2_1-mer 5 4 18 0.064 0.127 ns 203.280 DP_B2_20…. 11 12
rds_summary_noPV %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)

6.9 VennDiagram of HV-HJ sharing

The provided R code generates a Venn diagram to visualize the sharing of HV-HJ combinations between “20-mer” and “1-mer” clonotypes in the “Individualized” database. The code preprocesses the data from the “rds_summary” dataframe, filtering out “PostF” specificity and organizing the data based on “Group” (clonotype type). It then creates two lists, one for “20-mer” and another for “1-mer” clonotypes, containing distinct HV-HJ combinations for each group. The Venn diagram is plotted using ggVennDiagram, representing the shared and unique HV-HJ combinations between the two clonotype types. The diagram provides a concise visual understanding of the overlap and differences in HV-HJ pairing between “20-mer” and “1-mer” clonotypes in the “Individualized” database.

rds_summary_noPV <- rds_summary %>%
  # decide if clonal size greater than 1 should be included
  filter(database == "Individualized", Timepoint != "Single-cell") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) 

list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
                 "1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])

ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7, 
                             label_alpha = 0,
                             set_color = "black",
                             set_size = 9) +
  scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
  scale_color_manual(values = c("#5F90B0", "#D896C1")) +
  theme(legend.position = "none")

ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)

7 Generate processed files for diversity index estimation

Intermediate files are commented out, so it will not save all of them. One could uncomment them if all the intermediate files are needed. Intermediate files were used to run Recon (v2.5) (Kaplinsky & Arnaout, Nat Commmun, 2016) according to default parameters, more about running Recon is described in the next section Recon estimates for RSV-specific diversity.

# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
  mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))

# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()

# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
  # save full clonotype files
  filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
    filter(ID_timepoint_spec == animals) %>%
    as.data.frame()
 # write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)

  # save summary files
  clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
  #write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }

# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
  # save summary files
  clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
    filter(ID_timepoint == animals) %>%
    as.data.frame() %>%
    mutate(
      specificity_group = "Total",
      ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
    ) %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
 # write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }


# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()

for (animals in unique(clonotype_rsv$ID)) {
  # save summary files
  clonotype_rsv %>%
    filter(ID == animals) %>%
    as.data.frame() %>%
    group_by(grp, timepoint, ID) %>%
    summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup() 
   # write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
    }

to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
  select(clonal_size, ID_timepoint_spec) %>%
  group_by(clonal_size, ID_timepoint_spec) %>%
  summarise(size = n()) %>%
  ungroup()


for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
  to_recon_table <- to_recon_rsv %>%
    filter(ID_timepoint_spec == i) %>%
    select(clonal_size, size)

 # fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}

7.1 Calculating Species richness

7.1.1 Chao1 estimates for RSV-specific diversity

Species richness (Chao1) was calculated using the vegan r package. The samples were first subsampled 100 times for each animal/timepoint and then the species richness was estimated. The mean of those 100x replicates were used for plotting.

Here is the processing and estimation of species richness:

rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))


# subsample and replicate the subsampling 100 times for higher accuracy

chaox100 <- function(x, value_to_subsample) {
  replicate(100, {
    subsample <- vegan::rrarefy(x, value_to_subsample)
    chao <- vegan::estimateR(subsample)
    return(chao)
  })
}

diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))

# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
  print(spec_timepoint)
  rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
  min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
  chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
  names(x) <- gsub("_.*", "", names(x))
  x
}

# adjusted dataset for plotting
{
  chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
  chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
  chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
  # save intermediate file
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
  # diversity mean of x100 replicates per animal
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    group_by(algorithm, Timepoint_specificity) %>%
    summarise_all(.funs = mean) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)

  chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
    tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
    mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ))
}

7.1.2 Chao1 plotting between groups for RSV-specific diversity

Here is the plot and comparison between the vaccinated groups:

chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"

chao_results_df <- chao_results_df %>%
  filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
  filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
    Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
  summarise(value = mean(value)) %>%
  tidyr::drop_na() %>%
  ungroup()

write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)


stat.test <- chao_results_df %>%
              wilcox_test(formula = value ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
value Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.285 ns 272.3352 Total_B1…. 1 2
value Total_B2_20-mer Total_B2_1-mer 5 4 15 0.286 0.343 ns 301.8278 Total_B2…. 3 4
value PreF_B1_20-mer PreF_B1_1-mer 5 4 7 0.556 0.556 ns 331.3205 PreF_B1_…. 5 6
value PreF_B2_20-mer PreF_B2_1-mer 5 4 2 0.064 0.127 ns 360.8131 PreF_B2_…. 7 8
value DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
390.3058 DP_B1_20…. 9 10
value DP_B2_20-mer DP_B2_1-mer 5 4 20 0.016 0.048
419.7984 DP_B2_20…. 11 12
chao_results_df %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "value", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Chao1)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)

7.1.3 Recon estimates for RSV-specific diversity

According to Recon default settings and tutorial (check Recon manual), the count files generated previously were used to create the fitfiles.txt that were used as an input for generating the diversity_table.txt for all the samples.

To generate the the fitfile for each count file, the bash script used in a for loop was:

#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE

python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt

Then, each fit file was used for generating the diversity_table.txt with the command:

python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt

The results from diversity_table.txt for all the samples were used as input for plotting.

recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
   Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  ungroup()

write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
  

stat.test <- recon_res %>%
              wilcox_test(formula = est_0.0D ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
est_0.0D Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.228 ns 275.7082 Total_B1…. 1 2
est_0.0D Total_B2_20-mer Total_B2_1-mer 5 4 16 0.190 0.228 ns 304.8120 Total_B2…. 3 4
est_0.0D PreF_B1_20-mer PreF_B1_1-mer 5 4 7 0.556 0.556 ns 333.9159 PreF_B1_…. 5 6
est_0.0D PreF_B2_20-mer PreF_B2_1-mer 5 4 2 0.064 0.127 ns 363.0197 PreF_B2_…. 7 8
est_0.0D DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
392.1236 DP_B1_20…. 9 10
est_0.0D DP_B2_20-mer DP_B2_1-mer 5 4 20 0.016 0.048
421.2274 DP_B2_20…. 11 12
recon_res %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 1) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Recon)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)

7.2 Somatic hypermutation comparisons

7.2.1 Somatic hypermutation over time for individualized database

Here the SHM is shown for all the sequences for every animal combined. Data is divided by 20-mer, 1-mer, and specificities. T

rds_indiv_total <- rds_merge %>%
  mutate(specificity_group = "Total") 

rds_indiv_plot <- rbind(rds_merge, rds_indiv_total)


rds_indiv_plot <- rds_indiv_plot %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup() %>% drop_na(Group_specificity)

stat.test <- rds_indiv_plot %>%
              wilcox_test(formula = V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
V_errors Total_B1_20-mer Total_B1_1-mer 71438 41890 1096663994 0 0 **** 140.760 Total_B1…. 1 2
V_errors Total_B2_20-mer Total_B2_1-mer 141383 92712 4905379554 0 0 **** 151.272 Total_B2…. 3 4
V_errors PreF_B1_20-mer PreF_B1_1-mer 20774 21126 196698706 0 0 **** 161.784 PreF_B1_…. 5 6
V_errors PreF_B2_20-mer PreF_B2_1-mer 48561 47362 964010280 0 0 **** 172.296 PreF_B2_…. 7 8
V_errors DP_B1_20-mer DP_B1_1-mer 38256 8444 139047744 0 0 **** 182.808 DP_B1_20…. 9 10
V_errors DP_B2_20-mer DP_B2_1-mer 82239 33204 1055796386 0 0 **** 193.320 DP_B2_20…. 11 12
counts <- rds_indiv_plot %>% group_by(Group, specificity_group) %>% summarise(counts = n())

rds_indiv_plot %>%
  ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
#  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none") 

ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)

7.2.2 Plot somatic hypermutation over time summarized by animal

Here the SHM data is summarized per macaque, so each dot represents the average SHM of all the antigen-specific sequences for one animal.

rds_indiv_plot_summ <- rds_indiv_plot %>%
  filter(Group_specificity != "PostF") %>%
  group_by(ID, Group_specificity) %>%
  summarise(avg_V_errors = mean(V_errors, na.rm = TRUE),
            sd_V_errors = sd(V_errors, na.rm = TRUE),
            n = n()) %>%
  ungroup() %>%
  tidyr::drop_na()

stat.test <- rds_indiv_plot_summ %>%
              wilcox_test(formula = avg_V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
avg_V_errors Total_B1_20-mer Total_B1_1-mer 5 4 5 0.286 0.826 ns 25.37176 Total_B1…. 1 2
avg_V_errors Total_B2_20-mer Total_B2_1-mer 5 4 6 0.413 0.826 ns 26.81147 Total_B2…. 3 4
avg_V_errors PreF_B1_20-mer PreF_B1_1-mer 5 4 8 0.730 0.876 ns 28.25118 PreF_B1_…. 5 6
avg_V_errors PreF_B2_20-mer PreF_B2_1-mer 5 4 8 0.730 0.876 ns 29.69090 PreF_B2_…. 7 8
avg_V_errors DP_B1_20-mer DP_B1_1-mer 5 4 4 0.190 0.826 ns 31.13061 DP_B1_20…. 9 10
avg_V_errors DP_B2_20-mer DP_B2_1-mer 5 4 10 1.000 1.000 ns 32.57032 DP_B2_20…. 11 12
counts <- rds_indiv_plot_summ %>% group_by(Group_specificity) %>% summarise(max_n = max(n), min_n = min(n), total_n = sum(n))

rds_indiv_plot_summ %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "avg_V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 22) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none") 

write.csv(rds_indiv_plot_summ, paste0("../", result.dir, "rsv_specific_mean_SHM_per_animal.csv"), row.names = FALSE)

ggsave(paste0("../", result.dir, "rsv_specific_mean_SHM_per_animal.pdf"), width = 6, height = 3)

7.3 Plot HCDR3 length over time for individualized database divided by 20-mer and 1-mer, and specificities

rds_indiv_hcdr3 <- rds_indiv %>%
 # filter(database == "Individualized", Timepoint == c("B1", "B2")) %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(
    Group_specificity = paste(Group, specificity_group, sep = "_"),
    specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
    Timepoint = factor(Timepoint, levels = c("B1", "B2"))
  ) %>% drop_na()
  
counts <- rds_indiv_hcdr3 %>% group_by(specificity_group, Timepoint, Group) %>% summarise(N=paste0('n = ',n()))
counts
## # A tibble: 12 × 4
## # Groups:   specificity_group, Timepoint [6]
##    specificity_group Timepoint Group  N      
##    <fct>             <fct>     <chr>  <chr>  
##  1 Total             B1        1-mer  n = 383
##  2 Total             B1        20-mer n = 651
##  3 Total             B2        1-mer  n = 682
##  4 Total             B2        20-mer n = 976
##  5 PreF              B1        1-mer  n = 205
##  6 PreF              B1        20-mer n = 205
##  7 PreF              B2        1-mer  n = 389
##  8 PreF              B2        20-mer n = 346
##  9 DP                B1        1-mer  n = 144
## 10 DP                B1        20-mer n = 413
## 11 DP                B2        1-mer  n = 250
## 12 DP                B2        20-mer n = 583
rds_indiv_hcdr3 %>%
  ggplot(aes(x = cdr3_aa_length, fill = Group)) +
  geom_density(alpha = .7) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
  labs(y = "Density", x = "HCDR3 aa length") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
 # geom_text(data = counts, aes(label = N, fill = NA), x = Inf, y = Inf, hjust = 1, vjust = 1, size = 4, color = "black") +
  facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v") 

ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)

8 Comparison of discovered alleles (individualized database)

8.1 Plot V alleles unique and shared per animal, stacked plot

ls <- list.files("../data/databases/individualized", recursive = T, full.names = T, pattern = "V.fasta")
names(ls) <- basename(dirname(ls))
ls <- ls[-1] # remove combined V genes
individualized_dbs <- lapply(ls, Biostrings::readDNAStringSet)
size_indv_db <- data.frame(
  v_count = unlist(lapply(individualized_dbs, length)),
  type = "Shared"
) %>%
  tibble::rownames_to_column("ID")
individualized_dbs <- unlist(Biostrings::DNAStringSetList(individualized_dbs))
duplicated_names <- c(names(unique(individualized_dbs[length(individualized_dbs):1, ])), names(unique(individualized_dbs)))
individualized_dbs_sel <- individualized_dbs[setdiff(names(individualized_dbs), duplicated_names)]
size_unique_db <- data.frame(
  v_unique = table(substr(names(unique(individualized_dbs_sel)), 1, 3)),
  type = "Unique") %>%
  dplyr::rename(v_count = v_unique.Freq,
         ID = v_unique.Var1)


unique_v_indiv <- rbind(size_indv_db, size_unique_db) %>%
  add_row(ID = c("E11", "E24"), v_count = rep(0), type = rep("Unique")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )
unique_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_unique_shared_per_animal.csv"), row.names = FALSE)

unique_v_indiv %>%
  mutate(type = factor(type, levels = c("Unique", "Shared"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "unique_and_shared_alleles.pdf"), width = 8, height = 6.38)

8.2 Plot V alleles validated or not per animal, stacked plot

kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")

joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
##      width seq                                              names               
##  [1]   296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
##  [2]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
##  [3]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [4]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [5]   298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
##  ...   ... ...
## [11]   296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12]   299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13]   296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14]   299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15]   297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
  v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
  type = "Not validated"
) %>%
  dplyr::rename(
    v_count = v_unique.Freq,
    ID = v_unique.Var1
  )
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
  add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )

kimdb_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)

kimdb_v_indiv %>%
  mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)

9 Phylogenetic tree LOR21 and LOR24 monoclonal antibodies

9.1 Processing data

clonal_tree_data <- clonotype_rsv %>%
  ungroup() %>%
  distinct(new_name, .keep_all = TRUE) %>%
  group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
  add_tally(name = "clonal_size") %>%
  ungroup()

mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03", "LOR74" = "E11_05_B06")

# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")

# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")

9.2 Generate fasta files for tree plotting

Select all the sequences that are part of the same clonotypes from the mAbs of interest, in this case LOR21, LOR24, and LOR74.The selected sequences were saved as fasta for further multiple sequence alignment and phylogenetic tree inference.

9.3 Heavy chain

sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA heavy chain
  UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
  UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
  UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
  names(UCA) <- paste0(mab_name,"_UCA")
  # select sequences part of the same clonotype
  group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clonal_tree_data %>%
    filter(grp == group) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
            row.names = FALSE)
  }
  # deduplicating sequences
  fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  #adding single cell sequences
  sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
  
  #saving duplicated
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
    sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
  
  fasta <- c(fasta, sc_seqs, UCA)
  fasta <- fasta[unique(names(fasta))]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
 
  
}
## [1] 1
## [1] 2
## [1] 3

9.4 Light chain

for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA light chain
  UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
  UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
  UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
  names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
  # select light chain clonotypes
  group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clono_light_chains %>%
    filter(grp == group, substr(name,1,3) == substr(mab,1,3)) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
            row.names = FALSE)
  }
  # save object as BStringset
  fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
  
  # save duplicated values
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
  
  fasta <- c(fasta, UCA_LC)
  fasta <- fasta[names(fasta)]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir,  mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
## [1] 3

9.5 Run MUSCLE in bash

Do a Multiple Sequence Alignment using MUSCLE (v 5.1) for all the fasta files generated through out this analysis. Following that, run FastTree (v 2.1.11) for all the aligned sequences and save tree output to be plotted on the following code.


source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/    
for f in *.fasta; do muscle -align $f -output $f.aln; done

for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 217 seqs, avg length 364, max 370
## 
## 00:00 4.6Mb  CPU has 8 cores, running 8 threads
## 00:00 4.8Mb   0.0043% Calc posteriors
00:01 180Mb    0.96% Calc posteriors 
00:02 212Mb     1.9% Calc posteriors
00:03 222Mb     2.8% Calc posteriors
00:04 234Mb     3.8% Calc posteriors
00:05 254Mb     4.4% Calc posteriors
00:06 261Mb     4.7% Calc posteriors
00:07 284Mb     5.2% Calc posteriors
00:08 290Mb     5.8% Calc posteriors
00:09 302Mb     6.3% Calc posteriors
00:10 309Mb     6.9% Calc posteriors
00:11 326Mb     7.3% Calc posteriors
00:12 327Mb     7.8% Calc posteriors
00:13 348Mb     8.4% Calc posteriors
00:14 359Mb     9.0% Calc posteriors
00:15 374Mb     9.6% Calc posteriors
00:16 402Mb    10.0% Calc posteriors
00:17 408Mb    10.5% Calc posteriors
00:18 415Mb    11.0% Calc posteriors
00:19 430Mb    11.7% Calc posteriors
00:20 441Mb    12.1% Calc posteriors
00:21 453Mb    12.6% Calc posteriors
00:22 454Mb    13.0% Calc posteriors
00:23 468Mb    13.5% Calc posteriors
00:24 491Mb    14.0% Calc posteriors
00:25 514Mb    14.6% Calc posteriors
00:26 530Mb    15.3% Calc posteriors
00:27 557Mb    15.7% Calc posteriors
00:28 561Mb    16.2% Calc posteriors
00:29 581Mb    16.7% Calc posteriors
00:30 606Mb    17.3% Calc posteriors
00:31 622Mb    17.8% Calc posteriors
00:32 648Mb    18.4% Calc posteriors
00:33 667Mb    19.0% Calc posteriors
00:34 671Mb    19.3% Calc posteriors
00:35 691Mb    19.9% Calc posteriors
00:36 706Mb    20.4% Calc posteriors
00:37 714Mb    20.8% Calc posteriors
00:38 729Mb    21.3% Calc posteriors
00:39 730Mb    21.7% Calc posteriors
00:40 748Mb    22.2% Calc posteriors
00:41 761Mb    22.7% Calc posteriors
00:42 783Mb    23.2% Calc posteriors
00:43 801Mb    23.8% Calc posteriors
00:44 819Mb    24.3% Calc posteriors
00:45 805Mb    24.6% Calc posteriors
00:46 802Mb    25.0% Calc posteriors
00:47 790Mb    25.5% Calc posteriors
00:48 809Mb    26.1% Calc posteriors
00:49 700Mb    26.7% Calc posteriors
00:50 711Mb    27.3% Calc posteriors
00:51 729Mb    27.7% Calc posteriors
00:52 741Mb    28.3% Calc posteriors
00:53 764Mb    28.9% Calc posteriors
00:54 679Mb    29.7% Calc posteriors
00:55 686Mb    30.4% Calc posteriors
00:56 690Mb    31.1% Calc posteriors
00:57 701Mb    31.6% Calc posteriors
00:58 687Mb    32.2% Calc posteriors
00:59 691Mb    32.7% Calc posteriors
01:00 698Mb    33.4% Calc posteriors
01:01 710Mb    34.0% Calc posteriors
01:02 725Mb    34.7% Calc posteriors
01:03 726Mb    35.1% Calc posteriors
01:04 751Mb    35.6% Calc posteriors
01:05 753Mb    36.3% Calc posteriors
01:06 772Mb    36.9% Calc posteriors
01:07 780Mb    37.5% Calc posteriors
01:08 790Mb    38.0% Calc posteriors
01:09 804Mb    38.6% Calc posteriors
01:10 814Mb    38.9% Calc posteriors
01:11 828Mb    39.5% Calc posteriors
01:12 736Mb    40.1% Calc posteriors
01:13 759Mb    40.8% Calc posteriors
01:14 765Mb    41.4% Calc posteriors
01:15 768Mb    41.9% Calc posteriors
01:16 779Mb    42.7% Calc posteriors
01:17 791Mb    43.5% Calc posteriors
01:18 803Mb    44.1% Calc posteriors
01:19 813Mb    44.7% Calc posteriors
01:20 817Mb    45.2% Calc posteriors
01:21 825Mb    45.8% Calc posteriors
01:22 734Mb    46.4% Calc posteriors
01:23 749Mb    47.0% Calc posteriors
01:24 765Mb    47.6% Calc posteriors
01:25 827Mb    48.2% Calc posteriors
01:26 839Mb    48.7% Calc posteriors
01:27 841Mb    49.1% Calc posteriors
01:28 844Mb    49.7% Calc posteriors
01:29 851Mb    50.3% Calc posteriors
01:30 858Mb    50.9% Calc posteriors
01:31 853Mb    51.3% Calc posteriors
01:32 856Mb    51.7% Calc posteriors
01:33 856Mb    52.1% Calc posteriors
01:34 852Mb    52.4% Calc posteriors
01:33 854Mb    52.7% Calc posteriors
01:34 852Mb    52.8% Calc posteriors
01:35 861Mb    53.1% Calc posteriors
01:36 874Mb    53.5% Calc posteriors
01:37 878Mb    53.9% Calc posteriors
01:38 879Mb    54.2% Calc posteriors
01:39 880Mb    54.5% Calc posteriors
01:40 786Mb    54.9% Calc posteriors
01:41 796Mb    55.4% Calc posteriors
01:42 806Mb    56.1% Calc posteriors
01:43 825Mb    56.7% Calc posteriors
01:44 844Mb    57.3% Calc posteriors
01:45 848Mb    57.7% Calc posteriors
01:46 849Mb    58.3% Calc posteriors
01:47 854Mb    59.0% Calc posteriors
01:48 860Mb    59.6% Calc posteriors
01:49 866Mb    60.2% Calc posteriors
01:50 882Mb    60.7% Calc posteriors
01:51 770Mb    61.1% Calc posteriors
01:52 780Mb    61.8% Calc posteriors
01:53 774Mb    62.3% Calc posteriors
01:54 783Mb    62.8% Calc posteriors
01:55 790Mb    63.4% Calc posteriors
01:56 799Mb    63.9% Calc posteriors
01:57 809Mb    64.3% Calc posteriors
01:58 823Mb    64.8% Calc posteriors
01:59 830Mb    65.4% Calc posteriors
02:00 828Mb    65.9% Calc posteriors
02:01 822Mb    66.4% Calc posteriors
02:02 823Mb    66.7% Calc posteriors
02:03 849Mb    67.1% Calc posteriors
02:04 860Mb    67.6% Calc posteriors
02:05 872Mb    68.3% Calc posteriors
02:06 880Mb    68.8% Calc posteriors
02:07 900Mb    69.2% Calc posteriors
02:08 905Mb    69.7% Calc posteriors
02:09 917Mb    70.3% Calc posteriors
02:10 923Mb    70.8% Calc posteriors
02:11 924Mb    71.3% Calc posteriors
02:12 930Mb    71.8% Calc posteriors
02:13 813Mb    72.4% Calc posteriors
02:14 709Mb    72.9% Calc posteriors
02:15 734Mb    73.6% Calc posteriors
02:16 744Mb    74.1% Calc posteriors
02:17 753Mb    74.7% Calc posteriors
02:18 754Mb    75.3% Calc posteriors
02:19 765Mb    75.7% Calc posteriors
02:20 763Mb    76.3% Calc posteriors
02:21 771Mb    76.9% Calc posteriors
02:22 784Mb    77.5% Calc posteriors
02:23 791Mb    78.2% Calc posteriors
02:24 789Mb    78.8% Calc posteriors
02:25 792Mb    79.4% Calc posteriors
02:26 806Mb    80.1% Calc posteriors
02:27 805Mb    80.7% Calc posteriors
02:28 789Mb    81.2% Calc posteriors
02:29 802Mb    81.8% Calc posteriors
02:30 815Mb    82.5% Calc posteriors
02:31 687Mb    83.3% Calc posteriors
02:32 697Mb    83.9% Calc posteriors
02:33 700Mb    84.5% Calc posteriors
02:34 718Mb    85.2% Calc posteriors
02:35 741Mb    85.8% Calc posteriors
02:36 770Mb    86.3% Calc posteriors
02:37 763Mb    86.8% Calc posteriors
02:38 761Mb    87.2% Calc posteriors
02:39 762Mb    87.8% Calc posteriors
02:40 773Mb    88.4% Calc posteriors
02:41 789Mb    88.9% Calc posteriors
02:42 804Mb    89.5% Calc posteriors
02:43 808Mb    90.1% Calc posteriors
02:44 811Mb    90.6% Calc posteriors
02:45 835Mb    91.4% Calc posteriors
02:46 860Mb    92.2% Calc posteriors
02:47 884Mb    93.0% Calc posteriors
02:48 897Mb    93.5% Calc posteriors
02:49 897Mb    93.9% Calc posteriors
02:50 899Mb    94.3% Calc posteriors
02:51 897Mb    94.9% Calc posteriors
02:52 906Mb    95.2% Calc posteriors
02:53 783Mb    95.5% Calc posteriors
02:54 784Mb    96.0% Calc posteriors
02:55 784Mb    96.3% Calc posteriors
02:56 788Mb    96.9% Calc posteriors
02:57 792Mb    97.3% Calc posteriors
02:58 795Mb    97.8% Calc posteriors
02:59 815Mb    98.2% Calc posteriors
03:00 777Mb    98.4% Calc posteriors
03:01 776Mb    98.9% Calc posteriors
03:02 792Mb    99.3% Calc posteriors
03:03 798Mb    99.8% Calc posteriors
03:03 799Mb   100.0% Calc posteriors
## 03:03 799Mb   0.0043% Consistency (1/2)
03:04 835Mb     1.3% Consistency (1/2) 
03:05 841Mb     5.0% Consistency (1/2)
03:06 799Mb     8.9% Consistency (1/2)
03:07 799Mb    13.8% Consistency (1/2)
03:08 803Mb    19.7% Consistency (1/2)
03:09 815Mb    26.8% Consistency (1/2)
03:10 825Mb    32.8% Consistency (1/2)
03:11 834Mb    37.9% Consistency (1/2)
03:12 843Mb    43.0% Consistency (1/2)
03:13 790Mb    48.0% Consistency (1/2)
03:14 793Mb    53.0% Consistency (1/2)
03:15 806Mb    60.8% Consistency (1/2)
03:16 820Mb    68.6% Consistency (1/2)
03:17 820Mb    73.1% Consistency (1/2)
03:18 830Mb    81.0% Consistency (1/2)
03:19 845Mb    89.6% Consistency (1/2)
03:20 858Mb    97.2% Consistency (1/2)
03:20 863Mb   100.0% Consistency (1/2)
## 03:20 863Mb   0.0043% Consistency (2/2)
03:21 863Mb     5.4% Consistency (2/2) 
03:22 863Mb    14.9% Consistency (2/2)
03:23 860Mb    19.5% Consistency (2/2)
03:24 860Mb    28.0% Consistency (2/2)
03:25 860Mb    37.0% Consistency (2/2)
03:26 860Mb    46.3% Consistency (2/2)
03:27 860Mb    55.7% Consistency (2/2)
03:28 859Mb    62.5% Consistency (2/2)
03:29 854Mb    69.9% Consistency (2/2)
03:30 854Mb    78.8% Consistency (2/2)
03:31 852Mb    88.2% Consistency (2/2)
03:32 852Mb    97.5% Consistency (2/2)
03:32 851Mb   100.0% Consistency (2/2)
## 03:32 850Mb    0.46% UPGMA5           
03:32 850Mb   100.0% UPGMA5
## 03:32 850Mb     1.0% Refining
03:33 848Mb    11.0% Refining
03:34 816Mb    24.0% Refining
03:35 815Mb    45.0% Refining
03:36 818Mb    67.0% Refining
03:37 819Mb    90.0% Refining
03:37 818Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 8 seqs, avg length 322, max 326
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.4Mb     3.6% Calc posteriors
00:00 80Mb    100.0% Calc posteriors
## 00:00 84Mb      3.6% Consistency (1/2)
00:00 84Mb    100.0% Consistency (1/2)
## 00:00 84Mb      3.6% Consistency (2/2)
00:00 84Mb    100.0% Consistency (2/2)
## 00:00 84Mb     14.3% UPGMA5           
00:00 84Mb    100.0% UPGMA5
## 00:00 84Mb      1.0% Refining
00:00 85Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 118 seqs, avg length 370, max 370
## 
## 00:00 2.8Mb  CPU has 8 cores, running 8 threads
## 00:00 3.1Mb   0.014% Calc posteriors
00:01 83Mb     0.25% Calc posteriors
00:02 168Mb     2.1% Calc posteriors
00:03 215Mb     4.5% Calc posteriors
00:04 227Mb     6.9% Calc posteriors
00:05 242Mb     9.2% Calc posteriors
00:06 247Mb    11.7% Calc posteriors
00:07 251Mb    14.0% Calc posteriors
00:08 267Mb    15.7% Calc posteriors
00:09 285Mb    17.9% Calc posteriors
00:10 287Mb    20.1% Calc posteriors
00:11 302Mb    22.3% Calc posteriors
00:12 311Mb    24.4% Calc posteriors
00:13 327Mb    25.7% Calc posteriors
00:14 342Mb    27.7% Calc posteriors
00:15 382Mb    29.5% Calc posteriors
00:16 391Mb    31.6% Calc posteriors
00:17 409Mb    33.6% Calc posteriors
00:18 435Mb    35.9% Calc posteriors
00:19 437Mb    37.2% Calc posteriors
00:20 468Mb    39.5% Calc posteriors
00:21 478Mb    41.7% Calc posteriors
00:22 501Mb    44.2% Calc posteriors
00:23 527Mb    46.6% Calc posteriors
00:24 568Mb    49.2% Calc posteriors
00:25 576Mb    51.9% Calc posteriors
00:26 613Mb    54.4% Calc posteriors
00:27 628Mb    56.9% Calc posteriors
00:28 651Mb    59.5% Calc posteriors
00:29 676Mb    62.0% Calc posteriors
00:30 703Mb    64.6% Calc posteriors
00:31 720Mb    66.4% Calc posteriors
00:32 759Mb    69.1% Calc posteriors
00:33 815Mb    71.8% Calc posteriors
00:34 842Mb    74.2% Calc posteriors
00:35 858Mb    77.2% Calc posteriors
00:36 881Mb    80.2% Calc posteriors
00:37 883Mb    83.2% Calc posteriors
00:38 905Mb    86.4% Calc posteriors
00:39 799Mb    89.3% Calc posteriors
00:40 842Mb    92.2% Calc posteriors
00:41 858Mb    95.1% Calc posteriors
00:42 862Mb    97.8% Calc posteriors
00:42 869Mb   100.0% Calc posteriors
## 00:42 869Mb   0.014% Consistency (1/2)
00:43 871Mb     4.2% Consistency (1/2)
00:44 906Mb    73.4% Consistency (1/2)
00:44 919Mb   100.0% Consistency (1/2)
## 00:44 919Mb   0.014% Consistency (2/2)
00:45 919Mb    46.4% Consistency (2/2)
00:45 919Mb   100.0% Consistency (2/2)
## 00:45 919Mb    0.85% UPGMA5           
00:45 919Mb   100.0% UPGMA5
## 00:45 920Mb     1.0% Refining
00:46 922Mb    22.0% Refining
00:47 926Mb    93.0% Refining
00:47 927Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 10 seqs, avg length 331, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.4Mb     2.2% Calc posteriors
00:00 92Mb    100.0% Calc posteriors
## 00:00 92Mb      2.2% Consistency (1/2)
00:00 92Mb    100.0% Consistency (1/2)
## 00:00 92Mb      2.2% Consistency (2/2)
00:00 92Mb    100.0% Consistency (2/2)
## 00:00 92Mb     11.1% UPGMA5           
00:00 92Mb    100.0% UPGMA5
## 00:00 92Mb      1.0% Refining
00:00 93Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 71 seqs, avg length 370, max 370
## 
## 00:00 2.4Mb  CPU has 8 cores, running 8 threads
## 00:00 2.6Mb    0.04% Calc posteriors
00:01 159Mb     3.6% Calc posteriors
00:02 201Mb    11.3% Calc posteriors
00:03 236Mb    19.4% Calc posteriors
00:04 238Mb    27.2% Calc posteriors
00:05 253Mb    34.9% Calc posteriors
00:06 289Mb    42.6% Calc posteriors
00:07 316Mb    50.7% Calc posteriors
00:08 323Mb    58.8% Calc posteriors
00:09 336Mb    67.6% Calc posteriors
00:10 346Mb    75.9% Calc posteriors
00:11 361Mb    83.9% Calc posteriors
00:12 382Mb    92.2% Calc posteriors
00:13 404Mb   100.0% Calc posteriors
00:13 404Mb   100.0% Calc posteriors
## 00:13 404Mb    0.04% Consistency (1/2)
00:13 423Mb   100.0% Consistency (1/2)
## 00:13 423Mb    0.04% Consistency (2/2)
00:13 425Mb   100.0% Consistency (2/2)
## 00:13 425Mb     1.4% UPGMA5           
00:13 425Mb   100.0% UPGMA5
## 00:13 425Mb     1.0% Refining
00:14 426Mb     8.0% Refining
00:14 429Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 3 seqs, avg length 332, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb    33.3% Calc posteriors
00:00 2.5Mb   100.0% Calc posteriors
## 00:00 17Mb     33.3% Consistency (1/2)
00:00 17Mb    100.0% Consistency (1/2)
## 00:00 17Mb     33.3% Consistency (2/2)
00:00 17Mb    100.0% Consistency (2/2)
## 00:00 18Mb     50.0% UPGMA5           
00:00 18Mb    100.0% UPGMA5
## 00:00 18Mb      1.0% Refining
00:00 19Mb    100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.06 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
##       0.12 seconds: SPR round   1 of   2, 101 of 432 nodes
##       0.25 seconds: SPR round   1 of   2, 401 of 432 nodes
##       0.38 seconds: SPR round   2 of   2, 201 of 432 nodes
##       0.49 seconds: SPR round   2 of   2, 401 of 432 nodes
## Total branch-length 2.197 after 0.52 sec
##       0.64 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.74
##       0.85 seconds: Optimizing GTR model, step 2 of 12
##       0.95 seconds: Optimizing GTR model, step 3 of 12
##       1.08 seconds: Optimizing GTR model, step 5 of 12
##       1.21 seconds: Optimizing GTR model, step 8 of 12
##       1.32 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
##       1.45 seconds: ML Lengths 1 of 215 splits
##       1.55 seconds: Site likelihoods with rate category 11 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       1.69 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.81
## Turning off heuristics for final round of ML NNIs (converged)
##       1.81 seconds: ML NNI round 3 of 16, 1 of 215 splits
##       1.92 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
##       2.03 seconds: ML NNI round 3 of 16, 201 of 215 splits, 57 changes (max delta 1.019)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 2.06 (final)
## Optimize all lengths: LogLk = -5298.540 Time 2.12
##       2.26 seconds: ML split tests for    100 of    214 internal splits
##       2.41 seconds: ML split tests for    200 of    214 internal splits
## Total time: 2.44 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.03
## Total time: 0.04 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.04 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
##       0.12 seconds: SPR round   1 of   2, 101 of 234 nodes
##       0.27 seconds: SPR round   2 of   2, 101 of 234 nodes
## Total branch-length 1.596 after 0.35 sec
##       0.38 seconds: ML Lengths 101 of 116 splits
##       0.51 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.53
##       0.62 seconds: Optimizing GTR model, step 4 of 12
##       0.74 seconds: Optimizing GTR model, step 8 of 12
##       0.85 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
##       0.95 seconds: Site likelihoods with rate category 1 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       1.13 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 1.16
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 1.23
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 1.28
## Turning off heuristics for final round of ML NNIs (converged)
##       1.28 seconds: ML NNI round 5 of 14, 1 of 116 splits
##       1.42 seconds: ML NNI round 5 of 14, 101 of 116 splits, 0 changes
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 1.46 (final)
## Optimize all lengths: LogLk = -3814.284 Time 1.52
##       1.69 seconds: ML split tests for    100 of    115 internal splits
## Total time: 1.72 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.01 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.04
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.05 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.05
## Total time: 0.06 seconds Unique: 9/10 Bad splits: 0/6
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.01 seconds
## Refining topology: 25 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs
##       0.11 seconds: SPR round   2 of   2, 101 of 140 nodes
## Total branch-length 0.623 after 0.13 sec
## ML-NNI round 1: LogLk = -2035.235 NNIs 32 max delta 1.30 Time 0.19
##       0.22 seconds: Optimizing GTR model, step 3 of 12
##       0.32 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2003 0.2812 0.3012 0.2172
## GTR rates(ac ag at cg ct gt) 1.3353 1.7821 0.7868 0.4971 1.3891 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.772 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -1916.232 NNIs 8 max delta 0.00 Time 0.46
## Turning off heuristics for final round of ML NNIs (converged)
##       0.46 seconds: ML NNI round 3 of 12, 1 of 69 splits
## ML-NNI round 3: LogLk = -1916.232 NNIs 1 max delta 0.00 Time 0.54 (final)
## Optimize all lengths: LogLk = -1916.232 Time 0.56
## Total time: 0.66 seconds Unique: 71/71 Bad splits: 0/68
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs
## Total branch-length 0.053 after 0.00 sec
## ML-NNI round 1: LogLk = -560.888 NNIs 0 max delta 0.00 Time 0.00
## GTR Frequencies: 0.1960 0.3050 0.2620 0.2370
## GTR rates(ac ag at cg ct gt) 3.2025 7.6580 2.7938 2.4124 1.8012 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.635 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00 (final)
## Optimize all lengths: LogLk = -548.791 Time 0.00
## Total time: 0.01 seconds Unique: 3/3 Bad splits: 0/0

9.6 Reading and plotting trees

Generated trees arre edited using treeio and plotted with ggtree. The trees were rerooted to their respectives Unmutated Common Ancestor (UCA), which in this case is defined as just the V and J gene germlines combined. The gap between V-J is inserted automatically by the alignment method, thus the CDR3 here is not considered for the UCA.

# function to root tree in UCA 
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}

ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)

names(ls) <- lapply(ls, function(x) {
  if (stringr::str_count(x, "LOR") > 1) {
    substr(x, 24, 34)
  }else if (grepl("LC",x)) {
    substr(x, 24, 31)
  }else if (grepl("LOR",x)) {
    substr(x, 24, 28)
  }else{
    substr(x, 24,33)
    }})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)

for(i in seq_along(plots)){
  print(i)
  plot_name <- names(plots)[i]
  
  plots_edit <- plots[[i]]$data %>%
    mutate(new_label = ifelse(grepl("sc_|LOR|E11_TR1-B2-igg_M05494_64_000000000-CRY9F_1_2115_14487_10780",label), gsub("sc_", "",label), ""),
           new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
                                       warn_missing = FALSE),
           new_label = ifelse(grepl("LOR",new_label), new_label, ""),
           name = label,
           timepoint = case_when(grepl("B2-igg",name) ~ "B2",
                                 grepl("B1-igg",name) ~ "B1",
                                 grepl("sc_|LOR",name) ~ "single_cell",
                                 grepl("igm",name) ~ "PV",
                                 TRUE ~ "intersects"),
           name = gsub("sc_", "", label))
  
  
  if(stringr::str_count(plot_name, "LOR") > 1){
  plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name") 
  }else{ 
  plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name") 
  }
 # shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
  colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")

  gg_plot <- ggtree(plots_edit,aes(color = timepoint)) + 
    {if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
      else geom_tippoint(aes(size = duplicated), shape = 18)}+
   # geom_tippoint(aes(size = duplicated), shape = 23) +
    geom_tiplab(aes(label = new_label), hjust = -.2) +
    labs(size = "Count", shape = "Shape", color = "Timepoint") + 
    scale_color_manual(values = colors_timepoints) +
    coord_cartesian(clip = 'off') + 
   # scale_shape_manual(values = shapes_timepoints) +
    scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
    geom_treescale(width = .05)

  print(gg_plot)
  
   ggsave(gg_plot, filename = paste0("../", result.dir,  names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 2, 4), height = ifelse(grepl("LC", plot_name), 2, 6))
  }
## [1] 1

## [1] 2

## [1] 3

## [1] 4

## [1] 5

## [1] 6

11 Profiling expressed monoclonal antibodies

11.1 Heatmap of competition between LOR mAbs

LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.

data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
data_comp_auc$IC50_RSV_A2 <- log10(data_comp_auc$IC50_RSV_A2)

# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity", "IC50_RSV_A2")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
                                     "PreF/PostF" = "#CD87F8",
                                     "PostF" = "#92CDD6",
                                     "w.b." = "grey20"), epitope = c(
                                                    "Ø" = "#F3B084",
                                                    "V" = "salmon", 
                                                    "III" = "#A9D08D",
                                                    "IV" = "#FAD0FF",
                                                    "II" = "#FFD966",
                                                    "I" = "#9BC1E6",
                                                    "UK-Pre" = "#C9C9C9",
                                                    "UK-DP" = "#8497B0" ,
                                                    "WB" = "grey95",
                                                    "PostF" = "black",
                                                    "foldon" = "grey40"))

{
  g_heatmap_scale <- data_comp_auc %>%
    select(-all_of(to_remove)) %>%
    mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
    t() %>%
    pheatmap::pheatmap(scale = "none", angle_col = 90, cutree_cols = 8, 
                     clustering_method = "ward.D", 
                     color = viridisLite::viridis(100), 
                     cluster_rows = FALSE, border_color = "grey40", 
                     cluster_cols = TRUE, 
                     legend_breaks = c(0, 0.5, 1, 1.5, max(.)),
                     legend_labels = c("0", "0.5", "1", "1.5", "Normalized\ncompetition\n"),
                     annotation_col = data_comp_auc[,13:14], 
                     annotation_colors = annot_colors, 
                     cellwidth = 5, cellheight = 5, fontsize_row = 5, fontsize_col = 6, fontsize = 6)
 
  ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 18, height = 12, units = "cm")
}

11.2 Multidimensional scaling of competition between LOR mAbs

This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.

11.2.1 Colored by epitope specificity

# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))

data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name") 
to_remove <- c("epitope", "specificity", "IC50_RSV_A2")

# compute MDS
mds_scaled <- data_comp_auc %>%
  select(-all_of(c(to_remove, "name"))) %>%
  scale(center = TRUE, scale = TRUE) %>%
  dist(method = "euclidean") %>%          
  cmdscale() %>%
  as_tibble()

colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
  
p1 <- mds_scaled %>%
  left_join(data_comp_auc) %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_manual(values = annot_colors[["epitope"]]) +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1) 

plotly::ggplotly(p1)
ggsave(plot = p1, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-sites.pdf"))

11.2.2 Colored by RSV neutralization

p2 <- mds_scaled %>%
  left_join(data_comp_auc) %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = IC50_RSV_A2)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_viridis_c() +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  labs(color = "IC50 RSV\n(log10)")+
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1) 

plotly::ggplotly(p2)
ggsave(plot = p2, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-neuts.pdf"))

11.3 Processing and merging LOR metadata

mabs_lors <- mabs_lors %>%
  filter(name %in% lor_mabs$well_ID) %>%
  mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
  arrange(mAbs_ID) %>%
  relocate(mAbs_ID)

mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
  mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp), 
                                  mabs_lors[mabs_lors$name == i,]$mAbs_ID, 
                                  NA)
}

col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])         
mabs_clonal_rank <- mabs_clonal_rank %>%  
  mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
  select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
  filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
  arrange(LOR) %>%
  mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
  filter(ID == well_ID)
  
mabs_lors <- mabs_lors %>% 
  mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
         clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
         clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
  relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)

# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2, clonal size sc = 1), LOR40 (not detected B2, clonal size sc = 1 ), LOR42 (same clone as LOR01), LOR94 (same clone as LOR01), LOR96 (not detected B2, clonal size sc = 4)

#mabs_lors %>% write.csv(paste0("../data/single_cell/LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(file = paste0("../data/single_cell/LOR_mAbs_info.csv"), sep = ",")

mabs_lors %>%
  left_join(data_comp_auc %>% tibble::rowid_to_column("mAbs_ID") %>% mutate(mAbs_ID = as.character(name)), by = "mAbs_ID") %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = epitope)) +
  geom_point(size = 3) +
  scale_color_manual(values = annot_colors[["epitope"]]) +
  scale_y_log10() +
  scale_x_log10() +
  labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

11.4 Heavy and light chain paired V genes

merged_mabs_lors <- mabs_lors %>%
  left_join(clonotype_rsv, by = "name") %>%
  left_join(clono_light_chains,suffix = c("_HC","_LC"), by = "name") %>%
  mutate(v_call_HC = V_gene.x) %>%
  fill(v_call_LC)

write.csv(merged_mabs_lors, file = "../data/single_cell/LOR_mAbs_info_full.csv", row.names = F)

merged_mabs_lors %>%
  group_by(v_call_HC, v_call_LC) %>%
  mutate(mabs_counts = n()) %>%
  ggplot(aes(x= v_call_HC, y = v_call_LC, fill = mabs_counts)) +
    geom_tile(color = "black") +
    scale_fill_viridis_c(option = "viridis", direction = 1) +
    scale_y_discrete(position = "right") +
    labs(fill = "mAb counts", x = "IGHV alleles", y = "IGHJ alleles") + 
    coord_equal() +
    ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
     theme(axis.text.x = element_text(angle = 90, size = 6, hjust = 1, 
                                     vjust = 0.5, face = "bold", colour = "black"),
          axis.text.y = element_text(face = "bold", colour = "black", size = 6),
          strip.text = element_text(face = "bold"),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          legend.position =  "right",
          axis.ticks = element_line(size = .5),
          legend.title = element_text())

12 Take environment snapshot

renv::snapshot()

13 SessionInfo

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4    dplyr_1.1.0         ggprism_1.0.4      
##  [4] scales_1.2.1        vegan_2.6-4         lattice_0.20-45    
##  [7] permute_0.9-7       data.table_1.14.8   Biostrings_2.66.0  
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0      IRanges_2.32.0     
## [13] S4Vectors_0.36.0    BiocGenerics_0.44.0 ggpubr_0.6.0       
## [16] ggplot2_3.4.1       rstatix_0.7.2       ggtree_3.6.0       
## [19] treeio_1.22.0       tidyr_1.3.0         jsonlite_1.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] ggVennDiagram_1.2.2    colorspace_2.1-0       ggsignif_0.6.4        
##  [4] ellipsis_0.3.2         class_7.3-21           aplot_0.1.10          
##  [7] rstudioapi_0.14        proxy_0.4-27           farver_2.1.1          
## [10] ggrepel_0.9.3          fansi_1.0.4            xml2_1.3.3            
## [13] splines_4.2.2          cachem_1.0.7           knitr_1.42            
## [16] broom_1.0.4            cluster_2.1.4          pheatmap_1.0.12       
## [19] compiler_4.2.2         httr_1.4.5             backports_1.4.1       
## [22] Matrix_1.5-3           fastmap_1.1.1          lazyeval_0.2.2        
## [25] cli_3.6.0              htmltools_0.5.4        tools_4.2.2           
## [28] gtable_0.3.1           glue_1.6.2             GenomeInfoDbData_1.2.9
## [31] Rcpp_1.0.10            carData_3.0-5          jquerylib_0.1.4       
## [34] vctrs_0.5.2            ape_5.7                svglite_2.1.1         
## [37] nlme_3.1-162           crosstalk_1.2.0        xfun_0.37             
## [40] stringr_1.5.0          rvest_1.0.3            lifecycle_1.0.3       
## [43] zlibbioc_1.44.0        MASS_7.3-58.3          parallel_4.2.2        
## [46] RColorBrewer_1.1-3     yaml_2.3.7             gridExtra_2.3         
## [49] ggfun_0.0.9            yulab.utils_0.0.6      sass_0.4.5            
## [52] stringi_1.7.12         highr_0.10             tidytree_0.4.2        
## [55] e1071_1.7-13           rlang_1.0.6            pkgconfig_2.0.3       
## [58] systemfonts_1.0.4      bitops_1.0-7           evaluate_0.20         
## [61] purrr_1.0.1            sf_1.0-7               patchwork_1.1.2       
## [64] htmlwidgets_1.6.1      labeling_0.4.2         cowplot_1.1.1         
## [67] tidyselect_1.2.0       plyr_1.8.8             magrittr_2.0.3        
## [70] R6_2.5.1               generics_0.1.3         DBI_1.1.3             
## [73] pillar_1.8.1           withr_2.5.0            mgcv_1.8-42           
## [76] units_0.8-1            abind_1.4-5            RCurl_1.98-1.10       
## [79] tibble_3.2.0           crayon_1.5.2           car_3.1-1             
## [82] KernSmooth_2.23-20     utf8_1.2.3             plotly_4.10.1         
## [85] RVenn_1.1.0            rmarkdown_2.20         grid_4.2.2            
## [88] digest_0.6.31          classInt_0.4-9         webshot_0.5.4         
## [91] gridGraphics_0.5-1     munsell_0.5.0          viridisLite_0.4.1     
## [94] ggplotify_0.1.0        bslib_0.4.2