用Kaggle训练;用Google Cloud部署(英文翻译)
Train on Kaggle; deploy on Google Cloud by Bamigbade Opeyemi(原文链接)
用Kaggle训练;用Google Cloud部署(英文翻译)
封面原文:
The deployment of a machine learning (ML) model to production starts with actually building the model, which can be done in several ways and with many tools.
The approach and tools used at the development stage are very important at ensuring the smooth integration of the basic units that make up the machine learning pipeline. If these are not put into consideration before starting a project, there’s a huge chance of you ending up with an ML system having low efficiency and high latency.
For instance, using a function that has been deprecated might still work, but it tends to raise warnings and, as such, increases the response time of the system.
The first thing to do in order to ensure this good integration of all system units is to have a system architecture (blueprint) that shows the end-to-end integration of each logical part in the system. Below is the designed system architecture for this mini-project.
翻译:
将机器学习(ML)模型部署到生产中,首先要从实际构建模型开始,这可以通过多种方式和工具来完成。
在开发阶段使用的方法与工具对于确保组成机器学习管道的基本单元能够顺利地集成十分重要。如果在启动项目之前没有把这些都考虑进去,那么很有可能你最终得到的是一个效率低、延迟高的ML系统。
比方说,一个曾经被弃用了的函数虽然有效,但它老是会引起警告,因此增加了系统的响应时间。
为了确保所有系统单元能够良好地集成,首先要做的事情就是要有一个系统架构(蓝图),显示出系统中每个逻辑部分端到端的集成情况。下图是这个小型项目设计的系统架构。
原文:
Model Development
When we discuss model development, we’re talking about an iterative process where hypotheses are tested and models are derived, trained, tested, and built until a model with desired results is achieved.
This is the fun part for data scientist teams, where they can use their machine learning skills for tasks such as exploratory data analysis, feature engineering, model training, and evaluations on the given data.
The model used in this project was built and serialized on this Kaggle kernel using the titanic dataset. Note that I only used existing modules in standard packages such as Pandas, NumPy and sklearn so as not to end up building custom modules.
The performance of the model can be greatly improved on with feature transformation, but most transformers that work best on the data are not available on sklearn without the combination of Pandas, NumPy, and other useful libraries and that will lead to the building of additional modules during deployment. To keep things as simple as possible, I’ll refrain from exploring these topics in too much depth.
Kaggle is the largest online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish datasets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
翻译:
模型部署
当我们讨论模型开发的时候,说的是一个迭代过程,在这个过程中,许多假设不断地被测试、推导、训练、再测试,直到建立起一个符合预期结果的模型。
这是数据科学家团队的乐趣所在,在这个过程中,他们可以利用机器学习的技能对给定数据完成探索性数据分析、特征工程、模型训练以及评估等任务。
本项目中模型使用的Titanic数据集是在Kaggle内核上构建并序列化的。要注意的是只使用标准包中的现有模块,如Pandas、NumPy和sklearn等,以免使用自定义模块。
模型的性能可以通过特征变换得到很大的提升,但大多数在数据上效果最好的变换器在sklearn上,而没有用到Pandas、NumPy这样有用的基本库,这将导致在部署过程中需要构建额外的模块。为了让事情尽量简单,我就不对这些话题进行太深入的探讨了。
Kaggle是最大的数据科学家和机器学习从业者的在线社区。Kaggle允许用户在一个基于网络的数据科学环境中查找和发布数据集,探索和构建模型,与其他数据科学家和机器学习工程师合作,并参加比赛解决各种数据科学挑战。
原文:
ML Deployment on Google Cloud Platform
Some deployment optionsGoogle Cloud Platform (GCP) is one of the primary options for cloud-based deployment of ML models, along with others such as AWS, Microsoft Azure, etc.
With GCP, depending on how you choose to have your model deployed, there are basically 3 options which are:
- Google AI Platform
- Google Cloud Function
-
Google App Engine
Google Cloud
翻译:
在谷歌云平台上部署机器学习模型
谷歌云平台是基于云架构部署机器学习模型的一个主流选项,类型其他的有亚马逊云、微软云等。
使用谷歌云时,根据你选择的模型部署方式,基本有三个选择:
- 谷歌人工智能平台
- 谷歌云函数
- 谷歌应用引擎
原文:
Why the use of the App Engine for this project?
The App Engine is a cloud-based platform, is quite comprehensive, and combines infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). The runtime and languages are up to date with great documentation. Features in preview stage (beta) are made available to a large number of users, which keeps us informed of possible future developments.
To deploy this model on the App Engine using a terminal, there are four major things needed, which are:
-
The serialized model and model artifacts: This is the saved trained model and other standard objects used during data transformation. All will be stored in Google Storage (bucket) upon deployment so that they can be accessible by the main script for test data preparation and making predictions.
-
Main script.py: This is the script in which the prediction function is written, and where all necessary libraries listed in the requirements file needed for end-to-end data preparation and prediction are imported. I would add comments for each line of code so that it’s easier to read.
-
Requirement.txt: A simple text file that contains model dependencies with the exact version used during the training of the model. In order to avoid running into troubles, it’s better to check available versions of all the libraries and packages you’ll be using on the cloud before developing the model.
-
The app.yaml file: This is the file you can use to configure your App Engine app’s settings in the app.yaml file. This file specifies how URL paths correspond to request handlers and static files. The app.yaml file also contains information about your app's code, such as the runtime and the latest version identifier. Any configuration you omit on this file will be set to the default state. For this simple app, I only need to set the run time to python37 so that the App Engine can know the Docker image that will be running the app.
翻译:
为什么要用App Engine来做这个项目?
App Engine是一个基于云计算的平台,内容相当全面,集IaaS、PaaS、SaaS于一体。运行时和语言都是最新的,且有质量较高的文档支持。还会提供给用户大量的测试版功能,让我们随时了解最新的发展方向。
要用终端在App Engine上部署模型,主要用到以下四个东西:
- 序列化的模型和模型工件:这是保存下来训练过的模型和数据转换过程中使用的其他标准对象。在部署时,所有的东西都将被存储在 Google Storage中,以便主脚本可以访问它们,进行测试数据准备和预测。
- 主脚本.py:这是编写预测函数的脚本,在需求文件中列出了所有端到端数据预测必备的库。我会为每一行代码添加注释,以便易于阅读。
# 导入库
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import json
from flask import Flask , request, jsonify
app = Flask(__name__)
# 加载所有模型工件
model = joblib.load(open("model-v1.joblib","rb"))
age_imputer = joblib.load(open("age_imputer.joblib","rb"))
embark_imputer = joblib.load(open("embark_imputer.joblib","rb"))
One_hot_enc = joblib.load(open("One_hot_enc.joblib","rb"))
scaler = joblib.load(open("scaler.joblib","rb"))
# 数据预处理属性
drop_cols = ['PassengerId','Ticket','Cabin','Name']
gender_dic = {'male':1,'female':0}
cat_col = ['Embarked', 'Pclass']
def data_preprocessor(*,jsonify_data) -> 'clean_data':
""" 此函数用于处理未处理过的json数据"""
test = pd.read_json(jsonify_data) #读取json
passenger_id = list(test.PassengerId) #暂存下PassengerId中的记录值
if test.Fare.isnull().any().any():
test.Fare.fillna(value = test.Fare.mean(),inplace=True) #用平均数填充Fare特征列中的缺失值
test.Age = age_imputer.transform(np.array(test.Age).reshape(-1,1)) # 填充Age 特征列中的缺失值
test.Embarked = embark_imputer.transform(np.array(test.Embarked).reshape(-1,1)) # 填充Embark 特征列中的缺失值
test.drop(columns=drop_cols,axis=1,inplace = True) # 去掉训练中不要的列
test['Number_of_relatives'] = test.Parch + test.SibSp #增加特征列Number_of_relatives
test.drop(columns=['Parch','SibSp'],axis=1,inplace=True) #去掉Parch 和SibSp 特征列
test.Sex = test.Sex.map(gender_dic) #转换Sex 特征列格式为数字
encoded_test = pd.DataFrame(data=One_hot_enc.transform(test[cat_col]),columns=['emb_2','emb_3','Pclass_2','Pclass_3']) #用one-hot编码重编Embark 特征列
test.drop(columns=cat_col,axis=1,inplace=True) #去掉未处理过的目录特征列
test = pd.concat([test,encoded_test],axis=1) #将测试数据和重编后的目录特征列串联
test = scaler.transform(test) # 归一化所有值
return test, passenger_id
@app.route('/prediction_endpoint',methods=['GET','POST'])
#用于预设好的节点和方法 修饰prediction_endpoint 函数
def prediction_endpoint():
if request.method == 'GET':
return 'kindly send a POST request' #如果请求是GET返回这
elif request.method == 'POST':
input_data = pd.read_csv(request.files.get("input_file")) #通过API获取csv文件
testfile_json = input_data.to_json(orient='records') #转为json
#json_raw_data = request.get_json()
clean_data, id = data_preprocessor(jsonify_data=testfile_json) #引用预处理方法的数据
passenger_id = list()将pasenger id列置空
for passenger in id:
passenger_id.append('Passenger_' + str(passenger)) #)将pasenger id列加入数据
#预测
result = list()
model_predictions = list(np.where(model.predict(clean_data)==1,'Survived','died')) # 将二进制预测值转换为人可读的数据
for index,status in enumerate(model_predictions):
result.append([passenger_id[index],status])
response = json.dumps(result)
return response #将模型响应返回给用户
if __name__ == "__main__":
app.run()#运行flask app 实例
- 需求.txt*:一个简单的文本文件,包含模型的依赖关系,以及模型训练时用到的确切版本。为了避免遇到麻烦,在开发模型之前,最好检查一下云端上所有的库和包的可用版本。
scikit-learn==0.22
numpy==1.18.0
pandas==0.25.3
flask==1.1.1
- app.yaml文件:这是配置App Engine应用程序的文件。这个文件指定了URL路径如何发送请求处理程序和一些静态文件。app.yaml文件还包含关于你的应用程序的代码信息,例如运行时和最新版本标识符。在这个文件中任何未做的配置都将被设置为默认值。对于本项目只需将运行时设置为python37,App Engine就可以知道运行app的Docker镜像。
runtime: python37
原文:
Steps to deploying the model on Google’s App Engine
-
Create a project on Google Cloud Platform
-
Select the project and create an app using App Engine
Set up the application by setting the permanent region where you want Google to manage your app. After this step, select the programming language used in writing the app.
-
Either download Cloud SDK to deploy from your local machine or activate cloud shell from the cloud. For this demo, I’m using cloud shell. once the shell is activated, ensure that the Cloud Platform project is set to the intended project ID.
-
Clone your GitHub project repo on the engine by running (git clone <link to clone your repository> )
-
Change to the directory of the project containing the file to be uploaded on App Engine by running (cd ‘ cloned project folder’). You can call directories by running ls.
-
Initialize gcloud in the project directory by running gcloud init. This will trigger some questions on the configuration of the Google Cloud SDK, which are pretty straightforward and can be easily answered.
-
The last step is to deploy the app by running the command gcloud app deploy. It will take some time to upload files, install app dependencies, and deploy the app.
-
Once the uploading is done, you can run gcloud app browse to start the app in the browser—or copy the app URL manually if the browser isn’t detected.
翻译:
在谷歌App Engine上部署模型的步骤
-
在谷歌云平台上创建一个项目
创建项目 -
选择本项目并用App Engine创建个APP
创建应用
通过设置Google管理应用来建立应用程序,然后选择应用的编程语言。
选择编程语言 -
既可以下载云 SDK 后放在本地上部署,也可以直接在云上激活云控制台。在本演示中使用的是云控制台。激活控制台后,确保设置好云平台项目的ID。
图1 -
复制 GitHub 项目存储库到App Engine ,运行这个命令(git clone <复制你的节点链接> )
图2 -
更换包含上传项目的文件的所有文件夹,运行这个命令 (cd ‘ 复制的项目文件夹’)。你也能用ls打开目录
图3 - 通过运行 gcloud 在项目目录中初始化 gcloud。这将触发一些关于 Google 云 SDK 配置的问题,这些问题非常简单,并且可以轻松回答。
-
最后一步是通过运行命令 gcloud 应用部署来部署应用。上传文件,安装应用依赖项和部署应用要花上一些时间。
图4 -
上传完成后,您可以运行 gcloud app在浏览器中启动该应用,如果找不到浏览器要手动复制应用 URL。
图5
原文:
Test app with Postman
Since we didn't build any web interface for the project, we can use the Google app client to send an HTTP request to test the app, but we’ll use Postman to test it because we’re predicting in batches based on how we’re reading the dataset on the backend. Below is the response from the app after sending an HTTP request to get predictions for the uploaded test data.
Wrapping up
The deployment of ML models can take different forms depending on the dataset, target platform for deployment, how end-users will utilize it, and many other factors.
In order to have a smooth ride when deploying models, do a thorough review of all units that make up the ML system architecture you’re trying to implement.
翻译:
用Postman测试APP
因为没有为项目构建任何 Web 界面,我们要通过 Google 应用客户端发送 HTTP 请求来测试应用,不过也可以用 Postman 来测试,这样能根据后端读取数据集的方式进行批量预测。下图是发送 HTTP 请求后获取应用预测的响应情况。
API Responses
打包
机器学习模型的部署可以采取不同的形式,具体取决于数据集、部署的目标平台,以及终端用户将如何使用它等许多因素。
为了在部署模型时顺利运行,要对构成所要实现的机器学习系统体系结构的所有单元进行彻底地检查。