Machine learning in GCP(first we
以下是coursera上的课程:
Machine learning with tensorflow on Google Cloud Platform
https://www.coursera.org/learn/google-machine-learning/home/welcome
第一周内容的一些摘录与个人思考
1,To be successful at ML, you need to think, not just about creating models, but also serving out ML predictions
如果想好好利用机器学习,让其在你的业务中发挥作用,你应该多想想,如何实现prediction这一步,而不仅是创建模型。
2,we should make sure that we could process batch data and stream data the same way.
这和#1说的是同一个问题,有很多公司使用ML改进业务都是失败的,比如建好了模型,却不知道如何把生产中的数据源源不断地塞进这个模型来进行训练,比如:batch data,也就是常见的日志文件,图片等,stream data也就是收集上来的metrics和events,你有考虑过如何把这些不同类型的实际数据,转换成你模型定义的数据,然后传输进模型吗?
从数据角度,转换方式是否可靠?
从工程角度,打算选用什么消息引擎?
从代码角度,如何实现?
3,need to be good at data engineering
Machine Learning
Data Pipeline
Data Analytics
Data Collection
Scalability Reliability Engineering
如果要做好机器学习的实践,需要对数据工程有一定的积累。上面的5个词汇是一个金字塔从头到底的排序。可以看到,最底层也是一个SRE,不过这里的S指的是扩展,而不是站点。对于底层这个词汇我的理解是,你能保证你的可伸缩的服务在发展过程中,也能很好契合我们这次机器学习实践吗?(如提供训练数据,之后再使用模型提供的预测功能)
然后是数据收集,这里的难点应主要是收集系统的构建,能架得住负载、能随服务扩展、还得好实现
数据收集到了就是分析和处理,脏数据该过滤的过滤,过于原始的数据该加工的加工
之后怎么把这些数据传入进模型也是个问题(pipeline),我们可以在网上找到很多tensorflow的例子,就拿最火爆的识别衣服鞋子照片来说,别人的模型默认输入的是28X28像素的照片,可问题是,你怎么把你从生产环境搞来的各种各样的数据转化成如何规整的数据然后pipeline给模型来训练呢?
4,what's difference between ML and AI?
AI(artificial intelligence) is discipline, make machine act like human
ML(machine learning) is toolset, like neurton network
AI contains ML(AI技术包含ML)
机器学习和人工智能的关系是什么?简单说:AI是规则,AI让机器能以人的方式行动(比如人类做很多判断时,除了理性还有感性的因素)
机器学习是工具集,比如神经网络。
人工智能包含机器学习,也就是名词“水果”和“苹果”的关系
5,the old neurton network just have one hidden layer for :
computer power
data
computational tricks
神经网络其实在三十多年前就提出了,并且当时也实现了,但那时候的应用很有限,基本上也只有一层隐藏层(不像现在的模型有很多层,可能卷积或者压缩层就有好几层)
造成这个原因主要有:计算机算力太弱、数据量太少、tricks(这里可以理解为一些辅助技术,比如现在做ML,很多卷积模型、过滤手段、压缩手段都非常重组)
6,every product in Google has a dozen of ML models
this is an example:
Predict product demand
Predict inventory
Predict restocking time
16年底谷歌内部的ML模型大概有4000个,现在估计过万了。但是要注意,一个产品会有多个机器学习模型
Google photos
Google translate: when your take photo to a signal that you don not recognize
Model1: Identify the Sign
Model2: OCR the Characters
Model3: Identify Language
Model4: Translate Language
Model5: Superimpose Text
Model6: Select Correct Font
比如你用google翻译,拍了个标识,然后自动识别,这里就至少涉及了6个模型:1,识别出标识(停车、限速的牌子等)2,提取出字符 3,识别字符的语言 4,翻译 5,把翻译后的内容覆盖到原先的标识上 6,选择一个适合的字体
Smart Reply Inbox: a complicated ML application
sequence to sequence model
the output of previous model will be the input of next model
7, what kinds of problems can ML solve?
eric schmidt said ML is about replacing programming, but most of us think of predicting data.
机器学习到底要解决什么问题
注意!!
我们可以通过训练已有数据获取一个模型,然后我们自然而然想到训练好之后,来进行预测。但这个理解还需要更进一步,机器学习真正的目的是为了替代我们原有的业务模型,我们发现需求,然后我们编写代码,然后迭代
Machine learning scales better than hand-coded rules
机器学习的伸缩性能力好于手写业务逻辑规则
like you search "park" in search engine:
这里举了一个例子,比如n年前你在谷歌上搜索一个公园,那么实际上背后就有人为编写的一套逻辑,会查看用户的地理位置或资料里面的位置,然后根据一套手写的逻辑来返回搜索结果
hand-coded rules are really hard to maintain, ML scales better because it's automated
但这种手写的规则真的太难维护了
Google RankBrain (a deep neural network for search ranking and improve performance significantly)
这里提到了RankBrain,也就是谷歌搜索引擎的机器学习系统,它会学习用户的搜索,然后来判断或猜测用户到底想搜什么东西
So, we get conclusion: what kinds of problems can ML solve?
the answer is Anything for which you are writing rules for today
所以结论就是,机器学习到底是解决什么问题的呢?
只要你的业务中有需要人手动去编写规则,就可以用机器学习来替代
(这里让我想起做网络设备CMDB时,添加不同厂商不同类型的设备,可能就需要编写不同的巡检规则、不同的指标收集规则)
8,It'all about data
when you search "coffee near me"
example equals labelled data, label this above example is "Does the user like the result or does he not?"
9,Framing an ML problem
将机器学习问题框架化
如果我们要实践机器学习,那么可以从三个层面来框架化问题
cast it as learning problem(what data is for training, what is for predicting?)
机器学习层面:要训练哪些数据,要预测哪些信息,如何模型成 train_data 与 label(这里要结合tensorflow的样例代码,比如数字09自然可以用09来表述数据的label,但真实问题,如何定义数据的label呢?)
cast it as software problem(API for service,who will use service?how it doing today?)
从软件层面:最后要提供怎样的API?谁会使用这项服务(使用者关心的是什么),目前没实践机器学习之前是怎样处理业务问题?(痛点是?)
cast it in framework of a data problem(key actions to collect,analyze,)
Some scenario
10,Infuse your apps with ML
一些成功的实践经验
AUCNET as an example
AUCNET是一个日本网站,通过你拍摄汽车的照片,然后给你分析汽车型号,并整合其它服务(比如该型号的所有配置,购买信息等)
11,What is the pre-trained model?
GCP provide:
Vision API
Speech API
Jobs API
Translation API
Natural Language API
Video Intelligence API
这里相当于给GCP,谷歌云平台打的广告,意思是比如你要做一个识别翻译服务,没必要实现全部模型,比如谷歌云平台已经提供视觉识别的机器学习API
12,The ML marketplace is moving towards increasing levels of ML abstraction
ML的市场发展方向,是提升机器学习的抽象能力(怎么理解这句话呢,个人理解就好比从小学到大学甚至master phd所接触到的数学一样)
数学的核心是通过模型来解释现实,而很明显,y = kx+b这种方程能概括的现实问题远不如 傅里叶能 抽线的现实问题多
13,Build a data strategy around ML
14,Simple ML and More Data > Fancy ML and Small Data
so spend your energy collecting more data, not only quantity but also varity
机器学习最重要的不是你有多漂亮一个模型,或者这个模型多么高端和精准,模型是一个迭代的过程,而更重要的是大量的数据,并且除了足够的“量”,还需要尽可能多的种类
15,how to successfully applied ML?
Collecting data is often the longest and hardest part of ML project and the most likey to fail
应用ML的过程中,总耗时并且最容易导致失败的环节就是收集数据
collecting data contains rating, rating means finding labels for the data
这里的收集(collecting)还包括rating,这里的rating的意思是为数据设置label(这里我理解的是,在现实的业务环境下,更多的生产数据是很难用yes或no来简单的描述,我们或许会用还行、很好等词来形容,但对于机器,则不能适用这样模糊的描述,特别是用于训练的数据,则需要有明确用于区分的label)
ML is a journey towards automation and scale
请把我,实践机器学习的目的是为了自动化与业务规模化
when we talking ML, most engineers keep thinking training, but the true utility of ML comes during predictions
当我们谈论机器学习,大多数工程师会一直想如何训练,但机器学习真正的用处是预测这个过程,请把握这一点,不要过分纠结于模型训练
your models have to work on streaming data
你所创建的模型一定得能工作于流数据(这句话的意思就是,当我们学习tensorflow时,使用的数据集可能是预先准备好的,但如果投入实际生产,模型则需要能在从流数据中得到不断修正,也能为流数据做预测)
sometimes fail cuz something called training-serving skew
to reduce this skew, you'd better take the same code that was used to process historical data during training and reuse it during predictions
我们需要保证训练和预测使用相同的环境、相同的代码
your data pipeline have to process both batch and stream
你的数据管道需要能同时处理batch和stream data,这句和上面的work on streaming data是一个意思。batch data好理解也好实现,但是stream data就没那么好处理(这里也好想明白,特别机器学习这种很需要大量数据的业务,如果你搭建并使用过分布式消息引擎就明白stream data会带来的麻烦)
During prediction, the key performance aspect is speed of response
在预测环节,最重要的性能指标就是响应速度
the magic of ML comes with quantity, not complexity
ML的magic来自大量数据,而不是这东西的复杂度(不是代码写得复杂,b格高就好)
Unstructed data accounts for 90% of enterprise data(like email, video footage, texts, reports, catalog, events)
虽然我们学习ML,学习tensorflow时用的训练数据都是规整的,但实际业务中,超过90%数据都是非结构化的,比如邮件、视频、文本、报告等
pre-trained models make processing unstructed data easier
所以要学会使用别的公司、机构提供的现成模型来做数据处理(一方面给GCP的ML API打广告,另外一方面告诫希望实践ML的工程师,不要强求自己去实现ML中的各个环节)
business can benefit from ML?
1,Infuse your apps with ML, simplify user input adapt to user
2,fine-tune your business, streamline your business processes
3,Anticipate users' need creatively fulfill intent
How Google Does ML
Google suggests that we should pay more focus on collecting data and building infrastrucutre instead of optimizing ML algorithm
谷歌很讲自己是机器学习应用最成功的公司,甚至没有说之一。如果你想实践机器学习并帮助自己的业务,那么请为收集数据和创建基础设施(这里的基础设施,比如数据管道pipeline,比如应对服务批量部署的基础设施,记得前面说了ML的目标之一就是scale)下足够多的精力
Avoid these top 10 ML pitfalls
10个ML陷阱
1,ML requires just as much software infrastructure
successful ML practise needs lots of things around the algorithm like a whole software stack to serve
2,no data collected yet
there is no need to talk about ML without collecting great data or access to great data
3,assume the data is ready for use
4,keep human in loop
5,product launch for the wrong thing
6,ML optimizing for the wrong thing
7,is your ML improving things in the real world
8,using a pre-trained ML algorithm vs building your own
9,ML algorithm are trained more than once
10,trying to design your own perception or NLP algorithm
微信截图_20191101184501.png
这里可以学习下PPT技巧,在上面这张图中,除了1~10列举了10个陷阱,还通过前面带颜色的小方块说明可能出现的阶段
the good thing to hear: most of the values comes along the way.
as you march towards ML, you may not get there, and you will still greatly improve everything you're working on.when you get there, ML improves alomost everything it touches once you're ready.
前面讲了很多机器学习失败的原因,这里也需要给所有鼓鼓劲,在实践ML的过程中就会给你的业务带来好处
if the process to build and use ML is hard for your company, it's likely hard for the other members of your industry.
你需要明白,如果觉得实践过程很难,那么对同行也是一样的。
但如果你能做出一点成果,得到的反馈却是极好的,客户会容易感知到更好的服务,并给与你更多积极、准确的数据反馈,然后这个反馈又会促进你去微调业务
ML and business processes
Look at 5 phases:
1, Individual contributor
2, Delegation
3, Digitization
4, Big data and Analytics
5, Analytics Machine Learning
1~3阶段是传统的业务模型,45是最近几年火热的大数据与机器学习
微信截图_20191101190305.png 微信截图_20191104091059.png 微信截图_20191104091456.png 微信截图_20191104091759.png 微信截图_20191104091855.png 微信截图_20191104092049.png 微信截图_20191104092120.png
微信截图_20191104092424.png 微信截图_20191104092433.png 微信截图_20191104092532.png 微信截图_20191104152930.png
finally, great ML systems will need humans in the loop.
and you should think about ML as a way to expand the impact or to scale the impact of your people, not as a way of complete removing them.
the more people you have in your organization, the more voices you have to say, automation is impossible
这句话就是上面delegation这个阶段过久的一个问题,你的组织中人员越多,自动化就越难实现
Learn how to identify the origins of bias in ML/ make models inclusive/ evaluate ML models with biases
ML and human bias
想象一张鞋子的图片,不同人会有不同想象,这就是human bias
but just because something is based on data doesn't automatically make it neutral
因为模型是人类训练的,而即便对于相同的东西,不同的人也有不同的倾向,所以human bias是需要关注的一个问题
a common way that we evaluate performance in ML is by using a confusion matrix.
我们评估ML模型性能的一个方式就是使用confusion matrix
statistical measurement and acceptable tradeoff
we should focus on the False Positive Rate(labels says something doesn't exist but Model predicts it)
我们更应关心上图中的False Positive Rate
Rate = False Negatives / False Negatives + True Positives
False positive rate (α) = type I error = 1 − specificity = FP / (FP + TN) = 180 / (180 + 1820) = 9%
False negative rate (β) = type II error = 1 − sensitivity = FN / (TP + FN) = 10 / (20 + 10) = 33%
True positive rate (TPR), Recall, Sensitivity, probability of detection = Σ True positive/Σ Condition positive
Accuracy (ACC) = Σ True positive + Σ True negative/Σ Total population
Precision = Σ True positive/Σ Predicted condition positive
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
后面所讲的内容就是利用google的datalab在线进行学习与测试
这个产品就是类似google docs的在线编辑器