一文解决基因表达数据的聚类转换
2019-12-02 本文已影响0人
柳叶刀与小鼠标
问题1:我有一个基因表达矩阵,行为样本,列为基因。
问题是我想把它们转化为每一个基因的表达量为分类数据,例如说A基因在所有样本的表达范围是1—100,通过聚类分析,我们发现大多数样本在A基因的表达上为30左右,因为我们可以根据Kmeans方法将基因表达矩阵转化为30和非30两类标签。
今天使用PYTHON的方法:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 2 00:32:59 2019
@author: czh
"""
# In[*]
%reset -f
%clear
# In[*]
import pandas as pd
from sklearn.cluster import KMeans #导入K均值聚类算法
import os
os.chdir("D:\\train\\diff")
# In[*]
data = pd.read_csv("5year.csv",header=0,index_col=0)
d = data.iloc[:,1:78]
# In[*]
data.head()
# In[*]
d.head()
# In[*]
d.columns = d.columns.map(lambda x :str(x))
d.columns = d.columns+ "gene_exp"
d.columns
# In[*]
def f(x):
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2)
model.fit(d[[x]].as_matrix())
centers_d = pd.DataFrame(model.cluster_centers_).sort_values(by = 0)
group = [0] + list(centers_d.rolling(2).mean().iloc[1:][0]) + [d[x].max()]
s = pd.cut(d[x], group, labels = [ x + str(i) for i in range(2)])
return s
# In[*]
aprioriData = pd.DataFrame()
for i in range(77):
col_name = d.columns
col = col_name[i]
Data = f(col)
aprioriData = pd.concat([aprioriData,Data],axis=1)
# In[*]
discretization_d = pd.concat([aprioriData,data['Class']],axis =1)
# In[*]
discretization_d.head()
data.head()
Out[50]:
Class ADGRA2 ANGPTL2 ... TPST1 TSC22D3 VSTM4
id ...
AA80 0 0.776205 3.942062 ... 7.347908 10.512511 0.209625
A9TC 0 2.857827 3.229691 ... 2.324581 7.113074 0.485731
A5W6 0 1.161271 5.802349 ... 7.360124 21.058854 0.629902
A6DX 0 1.465745 7.821838 ... 6.256095 29.304477 1.290819
A8HH 0 9.702574 18.361627 ... 6.382861 29.900405 1.442875
[5 rows x 78 columns]
d.head()
Out[51]:
ADGRA2gene_exp ANGPTL2gene_exp ... TSC22D3gene_exp VSTM4gene_exp
id ...
AA80 0.776205 3.942062 ... 10.512511 0.209625
A9TC 2.857827 3.229691 ... 7.113074 0.485731
A5W6 1.161271 5.802349 ... 21.058854 0.629902
A6DX 1.465745 7.821838 ... 29.304477 1.290819
A8HH 9.702574 18.361627 ... 29.900405 1.442875
[5 rows x 77 columns]