R语言从零开始生物信息学从零开始学R. python新手日记

一文解决基因表达数据的聚类转换

2019-12-02  本文已影响0人  柳叶刀与小鼠标

问题1:我有一个基因表达矩阵,行为样本,列为基因。



问题是我想把它们转化为每一个基因的表达量为分类数据,例如说A基因在所有样本的表达范围是1—100,通过聚类分析,我们发现大多数样本在A基因的表达上为30左右,因为我们可以根据Kmeans方法将基因表达矩阵转化为30和非30两类标签。

今天使用PYTHON的方法:

# -*- coding: utf-8 -*-
"""
Created on Mon Dec  2 00:32:59 2019

@author: czh
"""


# In[*]
%reset -f
%clear
# In[*]
import pandas as pd
from sklearn.cluster import KMeans #导入K均值聚类算法
import os
os.chdir("D:\\train\\diff")

# In[*]
data = pd.read_csv("5year.csv",header=0,index_col=0)

d = data.iloc[:,1:78]
# In[*]
data.head()

# In[*]
d.head()


# In[*]
d.columns = d.columns.map(lambda x :str(x))

d.columns = d.columns+ "gene_exp"

d.columns
# In[*]
def f(x):
    from sklearn.cluster import KMeans
    model = KMeans(n_clusters=2)
    model.fit(d[[x]].as_matrix())

    centers_d = pd.DataFrame(model.cluster_centers_).sort_values(by = 0)
    group = [0] + list(centers_d.rolling(2).mean().iloc[1:][0]) + [d[x].max()]
    s = pd.cut(d[x], group, labels = [ x + str(i) for i in range(2)])

    return s
    

# In[*]
aprioriData = pd.DataFrame()

for i in range(77):
    col_name = d.columns
    col = col_name[i]
    Data = f(col)
    aprioriData =  pd.concat([aprioriData,Data],axis=1)

# In[*]
discretization_d =  pd.concat([aprioriData,data['Class']],axis =1)


# In[*]
discretization_d.head()
data.head()
Out[50]: 
      Class    ADGRA2    ANGPTL2  ...     TPST1    TSC22D3     VSTM4
id                                ...                               
AA80      0  0.776205   3.942062  ...  7.347908  10.512511  0.209625
A9TC      0  2.857827   3.229691  ...  2.324581   7.113074  0.485731
A5W6      0  1.161271   5.802349  ...  7.360124  21.058854  0.629902
A6DX      0  1.465745   7.821838  ...  6.256095  29.304477  1.290819
A8HH      0  9.702574  18.361627  ...  6.382861  29.900405  1.442875

[5 rows x 78 columns]
d.head()
Out[51]: 
      ADGRA2gene_exp  ANGPTL2gene_exp  ...  TSC22D3gene_exp  VSTM4gene_exp
id                                     ...                                
AA80        0.776205         3.942062  ...        10.512511       0.209625
A9TC        2.857827         3.229691  ...         7.113074       0.485731
A5W6        1.161271         5.802349  ...        21.058854       0.629902
A6DX        1.465745         7.821838  ...        29.304477       1.290819
A8HH        9.702574        18.361627  ...        29.900405       1.442875

[5 rows x 77 columns]
上一篇下一篇

猜你喜欢

热点阅读