MachineLearning.1.如何使用ML进行股票投资.D

2017-03-23  本文已影响0人  小异_Summer

参考内容:
os.walk()返回文件名排序问题:python pitfall (陷阱)--不同平台os模块文件名排序

续上一篇

4. Parsing data

解压ZIP包,放于合适的路径下。在PyCharm中新建Project选择anaconda下的python,脚本及运行结果如下:

import pandas as pd
import os
import time
from datetime import datetime

path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd

def Key_Stats(gather="Total Debt/Enquity (mrp)"):
    statspath = path+'/_KeyStats'
    stock_list = sorted([x[0] for x in os.walk(statspath)]) 
    #in Linux use sorted() func
    #print(stock_list)

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        #print(each_file)
        #time.sleep(15)
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                print(date_stamp, unix_time)
                #time.sleep(15)

Key_Stats()

5. More Parsing

import pandas as pd
import os
import time
from datetime import datetime

path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = sorted([x[0] for x in os.walk(statspath)]) #in Linux use sorted() func

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        #print(each_file)
        ticker = each_dir.split("/")[-1]  #in Linux use '/'

        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                #print(date_stamp, unix_time)
                full_file_path = each_dir+'/'+file
                #print(full_file_path)
                
                source = open(full_file_path, 'r').read()
                value = source.split(gather+':') #exist </td> or </th>, may exist \n, so just use : and split twice
                if 1 < len(value):
                    value = value[1].split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
                else:
                    value = 'NoValue'
                print(ticker+":",value)

            #time.sleep(15)

Key_Stats()

此处获取数据是使用的split和静态字符,更加广泛的应用参见Regular Expressions正则表达式。

6. Structuring data with Pandas

使用pandas将数据(datetime,unixtime,ticker,value)存入.csv文件中,其中value为'N/A'或者'NoValue'会pass。

import pandas as pd
import os
import time
from datetime import datetime

path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = sorted([x[0] for x in os.walk(statspath)]) #in Linux use sorted() func
    df = pd.DataFrame(columns=['Date','Unix','Ticker','DE Ratio'])

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("/")[-1]

        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path, 'r').read()
                try:
                    value = source.split(gather+':') #exist </td> or </th>, may exist \n, so just use : and split twice
                    if 1 < len(value):
                        value = value[1].split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
                    else:
                        value = 'NoValue'
                    print(ticker+":",value)
                    df = df.append({'Date':date_stamp, 'Unix':unix_time, 'Ticker':ticker, 'DE Ratio':float(value)}, ignore_index=True)
                except Exception as e:
                    pass

    save = gather.replace(' ','').replace('(','').replace(')','').replace('/','')+('.6.csv')
    print(save)
    df.to_csv(save)

Key_Stats()
.csv文件内容

使用Pandas结构化数据,提高处理效率。

7. Getting more data and meshing data sets

对带标签数据的处理目标是进行分类,在投资方面,仅区分一只股票:

如果如果精细分类,或许可以分为:

虽然Yahoo Finance提供了一些相关数据,但是为了练习两数据源融合,我们在Quandl获取S&P 500的相关数据,搜索并下载自2000年开始的数据,选择CSV格式。由于Quandl网站与教程中使用方法发生变化,因此在URL中输入视频里获取数据的地址,即下载S&P 500 Index数据集;也可从我的百度云盘下载,数据从2000年1月3号-2016年3月22号。

import pandas as pd
import os
import time
from datetime import datetime

path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd

def Key_Stats(gather="Total Debt/Equity (mrq)"):
#read the data sets
    statspath = path+'/_KeyStats'
    stock_list = sorted([x[0] for x in os.walk(statspath)]) #in Linux use sorted() func
    df = pd.DataFrame(columns=['Date','Unix','Ticker','DE Ratio','Price','SP500'])

    sp500_df = pd.DataFrame.from_csv("YAHOO-INDEX_GSPC.csv")

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("/")[-1]

        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path, 'r').read()
                try:
                    value = source.split(gather+':') #exist </td> or </th>, may exist \n, so just use : and split twice
                    if 1 < len(value):
                        value = value[1].split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
                    else:
                        value = 'NoValue'

                    try:
                        sp500_date = datetime.fromtimestamp(unix_time).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row["Adjusted Close"])
                    except:
                        sp500_date = datetime.fromtimestamp(unix_time-259200).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row["Adjusted Close"])
#The reason for the Try and Except here is because some of our stock data may have been pulled on a weekend day.
#If we hunt for a weekend day's value of the S&P 500, that date just simply wont exist in the dataset

                    stock_price = float(source.split('</small><big><b>')[1].split('</b></big>')[0])
                    print("ticker:",ticker,"sp500_date:",sp500_date,"stock_price:",stock_price,"sp500_value:",sp500_value)

#part of the stock_price doesn't exist
                    df = df.append({'Date':date_stamp,
                                    'Unix':unix_time,
                                    'Ticker':ticker,
                                    'DE Ratio':float(value),
                                    'Price':stock_price,
                                    'SP500':sp500_value}, ignore_index=True)
                except Exception as e:
                    pass

    save = gather.replace(' ','').replace('(','').replace(')','').replace('/','')+('.7.csv')
    print(save)
    df.to_csv(save)

Key_Stats()

其中嵌套try-catch块是由于股市周末没有S&P 500值,因此减去3天的时间(单位:秒);
相比于TotalDebtEquitymrq.6.csv,本次生成的TotalDebtEquitymrq.7.csv缺少部分数据,经调试发现大部分缺少数据是由于来自YaHoo Finance的HTML文件中没有当天的stock_price。

调试输出 TotalDebtEquitymrq.7.csv
上一篇下一篇

猜你喜欢

热点阅读