Задание 1.¶
Реализуем алгоритм выделения структурных доменов DOMAK. Контакты между остатками посчитаем с помощью arpeggio.
Необходимые импорты и установка arpeggio.
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c conda-forge gemmi openbabel biopython
!pip install pdbe-arpeggio
!pdbe-arpeggio -h
usage: pdbe-arpeggio [-h] [-s SELECTION [SELECTION ...] | -sf SELECTION_FILE] [-wh] [-mh]
[-ms MINIMISATION_STEPS] [-mf {MMFF94,UFF,Ghemical}]
[-mm {DistanceGeometry,SteepestDescent,ConjugateGradients}] [-co VDW_COMP]
[-i INTERACTING] [-ph PH] [-sa] [-a] [-o OUTPUT] [-m]
filename
############
# ARPEGGIO #
############
A library for calculating interatomic contacts in proteins
Dependencies:
- Python >= 3.6
- Numpy
- BioPython >= v1.80
- Open Babel >= 3.0.0
- gemmi >= 0.5.8
positional arguments:
filename Path to the file to be analysed.
options:
-h, --help show this help message and exit
-s SELECTION [SELECTION ...], --selection SELECTION [SELECTION ...]
Select the "ligand" for interactions, using selection syntax: /<chain_id>/<res_num>[<ins_code>]/<atom_name> or RESNAME:<het_id>. Fields can be omitted.
-sf SELECTION_FILE, --selection-file SELECTION_FILE
Selections as above, but listed in a file.
-wh, --write-hydrogenated
Write an output file including the added hydrogen coordinates.
-mh, --minimise-hydrogens
Energy minimise OpenBabel added hydrogens.
-ms MINIMISATION_STEPS, --minimisation-steps MINIMISATION_STEPS
Number of hydrogen minimisation steps to perform.
-mf {MMFF94,UFF,Ghemical}, --minimisation-forcefield {MMFF94,UFF,Ghemical}
Choose the forcefield to minimise hydrogens with. Ghemical is not recommended.
-mm {DistanceGeometry,SteepestDescent,ConjugateGradients}, --minimisation-method {DistanceGeometry,SteepestDescent,ConjugateGradients}
Choose the method to minimise hydrogens with. ConjugateGradients is recommended.
-co VDW_COMP, --vdw-comp VDW_COMP
Compensation factor for VdW radii dependent interaction types.
-i INTERACTING, --interacting INTERACTING
Distance cutoff for grid points to be 'interacting' with the entity.
-ph PH pH for hydrogen addition.
-sa, --include-sequence-adjacent
For intra-polypeptide interactions, include non-bonding interactions between residues that are next to each other in sequence; this is not done by default.
-a, --use-ambiguities
Turn on abiguous definitions for ambiguous contacts.
-o OUTPUT, --output OUTPUT
Define output directory.
-m, --mute Silent mode without debug info.
import json
import pandas as pd
Arpeggio работает с файлами в формате cif.
!wget http://www.rcsb.org/pdb/files/7DGD.cif.gz
!gzip -d 7DGD.cif.gz
!ls
--2023-12-13 16:59:13-- http://www.rcsb.org/pdb/files/7DGD.cif.gz Resolving www.rcsb.org (www.rcsb.org)... 128.6.159.248 Connecting to www.rcsb.org (www.rcsb.org)|128.6.159.248|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://www.rcsb.org/pdb/files/7DGD.cif.gz [following] --2023-12-13 16:59:13-- https://www.rcsb.org/pdb/files/7DGD.cif.gz Connecting to www.rcsb.org (www.rcsb.org)|128.6.159.248|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://files.rcsb.org/download/7DGD.cif.gz [following] --2023-12-13 16:59:13-- https://files.rcsb.org/download/7DGD.cif.gz Resolving files.rcsb.org (files.rcsb.org)... 128.6.159.157 Connecting to files.rcsb.org (files.rcsb.org)|128.6.159.157|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 280261 (274K) [application/octet-stream] Saving to: ‘7DGD.cif.gz’ 7DGD.cif.gz 100%[===================>] 273.69K --.-KB/s in 0.07s 2023-12-13 16:59:13 (3.61 MB/s) - ‘7DGD.cif.gz’ saved [280261/280261] 7DGD.cif condacolab_install.log sample_data
!mkdir ./arp-results3
По умолчанию ищутся контакты внутри всей структуры. При обозначении одной цепи как лиганда (в данном случае цепи А, если в структуре всего одна цепь уберите флаг -s) у записей выдачи появляются метки interacting_entities, см выдачу ниже.
!pdbe-arpeggio 7DGD.cif -ph 7.0 -wh -o ./arp-results3 -s /A//
!ls ./arp-results3
7DGD_hydrogenated.mmcif 7DGD.json
Обратите внимание на типы взаимодействующих объектов (внутри выборки, между выборкой и чем-то ещё и тд)
Выдача в виде json-файла, воспользуемся пакетом json для парсинга и переведём в более человеко-читаемый вид таблицы. Также отфильтруем контакты нужного нам типа между выборкой (цепью А в данном случае) и остальной структурой (INTER)
with open('./arp-results3/7DGD.json', 'r') as js_file:
contacts = json.load(js_file)
output_dict = [x for x in contacts if (any(y in ['proximal','vdw','vdw_clash','clash'] for y in x['contact']) and (x['interacting_entities'] == 'INTRA_SELECTION') ) ] #
df = pd.json_normalize(output_dict)
df
| contact | distance | interacting_entities | type | bgn.auth_asym_id | bgn.auth_atom_id | bgn.auth_seq_id | bgn.label_comp_id | bgn.label_comp_type | bgn.pdbx_PDB_ins_code | end.auth_asym_id | end.auth_atom_id | end.auth_seq_id | end.label_comp_id | end.label_comp_type | end.pdbx_PDB_ins_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | [proximal] | 4.88 | INTRA_SELECTION | atom-atom | A | O | 695 | THR | P | A | CB | 697 | LYS | P | ||
| 1 | [proximal] | 4.97 | INTRA_SELECTION | atom-atom | A | CA | 697 | LYS | P | A | O | 695 | THR | P | ||
| 2 | [proximal] | 4.45 | INTRA_SELECTION | atom-atom | A | CA | 694 | CYS | P | A | O | 692 | LYS | P | ||
| 3 | [proximal] | 4.79 | INTRA_SELECTION | atom-atom | A | O | 692 | LYS | P | A | CA | 688 | GLY | P | ||
| 4 | [proximal] | 4.33 | INTRA_SELECTION | atom-atom | A | N | 698 | PRO | P | A | CG | 696 | ARG | P | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26962 | [proximal] | 4.78 | INTRA_SELECTION | atom-atom | A | CE2 | 418 | TYR | P | A | NZ | 88 | LYS | P | ||
| 26963 | [proximal] | 4.25 | INTRA_SELECTION | atom-atom | A | NZ | 88 | LYS | P | A | CZ | 418 | TYR | P | ||
| 26964 | [proximal, polar] | 3.34 | INTRA_SELECTION | atom-atom | A | OH | 418 | TYR | P | A | NZ | 88 | LYS | P | ||
| 26965 | [proximal] | 4.67 | INTRA_SELECTION | atom-atom | A | CG | 88 | LYS | P | A | CD2 | 84 | HIS | P | ||
| 26966 | [proximal] | 4.76 | INTRA_SELECTION | atom-atom | A | CG | 88 | LYS | P | A | NE2 | 84 | HIS | P |
26967 rows × 16 columns
Отфильтруем контакты только белок-белок, а также оставим только по 1 контакту между парой остатков.
df_protein_only = df.loc[(df['bgn.label_comp_type'] == 'P') & (df['end.label_comp_type'] == 'P')]
df_droped = df_protein_only.drop_duplicates(subset=['bgn.auth_seq_id', 'end.auth_seq_id'], keep='first')
df_droped.sort_values(by=['distance'])
| contact | distance | interacting_entities | type | bgn.auth_asym_id | bgn.auth_atom_id | bgn.auth_seq_id | bgn.label_comp_id | bgn.label_comp_type | bgn.pdbx_PDB_ins_code | end.auth_asym_id | end.auth_atom_id | end.auth_seq_id | end.label_comp_id | end.label_comp_type | end.pdbx_PDB_ins_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11897 | [vdw_clash] | 2.72 | INTRA_SELECTION | atom-atom | A | O | 537 | ARG | P | A | O | 541 | VAL | P | ||
| 6593 | [vdw_clash, polar] | 2.73 | INTRA_SELECTION | atom-atom | A | O | 664 | VAL | P | A | OG | 668 | SER | P | ||
| 20652 | [vdw_clash, polar] | 2.73 | INTRA_SELECTION | atom-atom | A | OG | 344 | SER | P | A | O | 408 | SER | P | ||
| 384 | [vdw_clash, polar] | 2.77 | INTRA_SELECTION | atom-atom | A | N | 680 | ASN | P | A | O | 676 | VAL | P | ||
| 12486 | [vdw_clash] | 2.77 | INTRA_SELECTION | atom-atom | A | O | 560 | PHE | P | A | O | 535 | VAL | P | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12760 | [proximal] | 5.00 | INTRA_SELECTION | atom-atom | A | CB | 550 | CYS | P | A | CA | 562 | CYS | P | ||
| 7218 | [proximal] | 5.00 | INTRA_SELECTION | atom-atom | A | CG | 651 | LYS | P | A | CZ | 585 | TYR | P | ||
| 5182 | [proximal] | 5.00 | INTRA_SELECTION | atom-atom | A | N | 722 | VAL | P | A | CB | 725 | ILE | P | ||
| 4814 | [proximal] | 5.00 | INTRA_SELECTION | atom-atom | A | O | 192 | LEU | P | A | O | 194 | ASP | P | ||
| 18242 | [proximal] | 5.00 | INTRA_SELECTION | atom-atom | A | C | 813 | ILE | P | A | C | 816 | CYS | P |
5580 rows × 16 columns
Напишите функцию расчета split_value по номеру остатка.
SplitValue=(intA/extAB)∙(intB/extAB), где intA – число пар контактирующих остатков из A, intB – число пар контактирующих остатков из B, extAB – число пар контактирующих остатков, один из A, а другой – из B.
def split_value(split_pos, df):
intA = len(df[(df["bgn.auth_seq_id"] <= split_pos) & (df["end.auth_seq_id"] <= split_pos)])
intB = len(df[(df["bgn.auth_seq_id"] > split_pos) & (df["end.auth_seq_id"] > split_pos)])
extAB = len(df[((df["bgn.auth_seq_id"] > split_pos) & (df["end.auth_seq_id"] <= split_pos)) | ((df["bgn.auth_seq_id"] <= split_pos) & (df["end.auth_seq_id"] > split_pos))])
if extAB != 0:
return (intA/extAB)*(intB/extAB)
else:
return 0
Вычислите split_value для каждого остатка в вашем белке. Постройте график split_value в зависимости от номера остатка.
split_data = [split_value(i+1, df_droped) for i in range(833)] #833 - длина цепи А
from matplotlib import pyplot as plt
import seaborn as sns
plt.figure(1)
sns.set_style("darkgrid")
plt.plot(range(1, 834), split_data, color='#76ee00')
plt.xlabel("Residues")
plt.ylabel("SplitValue")
Text(0, 0.5, 'SplitValue')
Рассмотрите пики split_value в структуре вашего белка. Сделайте иллюстрацию в PyMOL, показывающую разделение вашей структуры на домены по split_value.
# приблизим график и найдём пики
plt.figure(1)
sns.set_style("darkgrid")
plt.plot(range(1, 834), split_data, color='#ee145b')
plt.xlim(510, 600)
plt.show()
first_peak = split_data.index(max(split_data[520:530]))
second_peak = split_data.index(max(split_data[570:590]))
print(f'Первый пик: {first_peak}. Второй пик: {second_peak}')
Первый пик: 522. Второй пик: 580
from IPython.display import Image
Image("domains.png", width=1000)
Рис.1. Разметка доменов после DOMAK и вычисления SplitValue (покраска по доменам): 31-522 - бирюзовый, 522-580 - сиреневый, 580-863 - лаймовый (ссылка на сессию).