Script to identify regions of differential selection across the HIV or SHIV envelope gene ectodomain in longitudinal single genome sequencing data. While these scripts were written for this specific use case, it would also be applicable to similar situations with some custom modifications. For questions please contact Michael P. Hogarty (michael.hogarty@pennmedicine.upenn.edu).
- Familiarity with the command-line interface and/or R
- R https://www.r-project.org/ version 4.3.1
- seqinr version 4.2-36
- dplyr version 1.1.4
- ggplot2 version 3.5.1
- MPH_extract.R
- Takes Env sequencing alignments as input and extracts regions of interest using HXB2 numbering
- MPH_run.R
- Takes output from MPH_extract.R and calculates hamming distance between each sequence and a reference sequence
- MPH_summary.R
- Takes output from MPH_run.R and calculates p values for each region of interest
- MPH_moving_region_multi.sh
- Bash script to call other scripts for every continous region of Env of a given length
- Also enables multiprocessing
Configure the variables in MPH_moving_region_multi.sh before running:
- project_folder: path to output directory (does not have to exist before running)
- macaques_file: list of individuals with available sequencing data to analyze, expected input is a text file with one individual per line
- group_file: list of individuals to compare selection in, expected input is a text file with one individual per line
- alignments_folder: directory containing fasta alignments. Expected input is a single folder with a file for each individual in macaques_file named with the individual's name and "_hxb2.fasta" appended. The Env reference sequence from HXB2 should be included in each alignment. The first sequence in the alignment will be used as the reference sequence from which to idenify divergence over time, and it is suggested to use the transmitted/founder Env as a reference
- region_size: length of region to analyze. Use one less than desired total length (i.e. enter "9" for a moving region of 10 amino acids)
- cores: number of cores for multiprocessing
Call script:
$ bash MPH_moving_region_multi.sh
Visualize data by running MPH_summary.R
If you use this script please cite: Habib R, Roark RS, Li H, Connell AJ, Hogarty MP, Wagh K, Wang S, Marchitto L, Skelly AN, Carey JW, Sowers KJ, Ayyanathan K, Plante SJ, Bibollet-Ruche F, Park Y, Agostino CJ, Singh A, Martella CL, Lewis E, Rando JM, Chohan N, Lora J, Ding W, Campion MS, Zhao C, Liu W, Li Y, Li X, Liang B, Chowdhury RR, Amereh K, Van Itallie E, Sheng Z, Ghosh AR, Bar KJ, Williams WB, Wiehe K, Saunders KO, Edwards RJ, Cain DW, Lewis M, Batista FD, Burton DR, Andrabi R, Kulp DW, Haynes BF, Korber B, Shapiro L, Kwong PD, Hahn BH, Shaw GM. Env-antibody coevolution identifies B cell priming as the principal bottleneck to HIV-1 V2 apex broadly neutralizing antibody development. bioRxiv [Preprint]. 2025 Sep 19:2025.05.03.652068. doi: 10.1101/2025.05.03.652068. PMID: 40502159; PMCID: PMC12154871.