Your Guide to NLP with MLSQL Sta
End2End NLP with MLSQL Stack
MLSQL stack supports a complete pipeline of train/predict. This means the following steps can be in the same script:
- collect data
- preprocess data
- train
- predict
Also, since any model and preprocess ET can be registered as function, you can reuse all these functions in Predict Service without coding any more.
Requirements
This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.
If you meet any problem when deploying, please let me know, please feel free to address any issue in this link.
Data Prepare
This article we will deal with Chinese.
Download sogou news from this site: news_sohusite.
image.pngUpload file to MLSQL Stack File Server
Upload news_sohusite_xml.full.tar
to MLSQL Stack file server, just drag the file to the upload area:
Once done, the web will indicate the success with showoing one file have been uploaded
, it looks like this:
Download the file and save to your home
In order to read this file, we should save this file to our home. Use a command like the following:
-----------------------------------------
-- Download from file server.
-- run command as DownloadExt.`` where
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line.
-----------------------------------------
!saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;
Check if the file has been created:
!fs -ls /tmp/nlp/sogo;
Well, it has been created sucessfully.
Found 1 items
-rw-r--r-- 1 allwefantasy admin 1537763850 2019-05-09 16:59 /tmp/nlp/sogo/news_sohusite_xml.dat
Load the xml data
MLSQL stack supports lots of datasource which inlcude XML and the news_sohusite_xml.dat
is XML format. We can use load statement to load the data:
-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData;
Notice that you can select any statement and then execute it and check the result is whether you expect:
image.pngExtract label from URL
The URL is lokk like this:
http://sports.sohu.com/20070422/n249599819.shtml
We need to extract the sports
from it. It means this article belongs to sports category.
select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp
where temp.labelStr is not null
as rawData;
The label we extract from URL is string, and the algorithm RandonForest requires the label an integer. Here we use StringIndex to implements get the mapping between string and number:
train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;
Now we can convert all string label to interger label:
predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;
Notice that we need to register this model as a function because we need to convert the number back to string in later predict stage. It's easy to do this:
register StringIndex.`/tmp/nlp/label_mapping` as convert_label;
Split the dataset
Sometimes We need to reduce the dataset because of limited resource we have. In another scenario, we may need to split the data into train/validate/test sets. They all can be implemented by ET RateSampler. In MLSQL, many ET have a more easy way to use, we call it command line style
. Here are ET style and Command Line style.
ET Style:
run xmlData as RateSampler.``
where labelCol="url" and sampleRate="0.9,0.1"
as xmlDataArray;
Command Line Style:
!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
Now, we have splitted dataset of each category into 0.9/0.1. In order to speed up the performance, we use 10% data only.
select * from xmlDataArray where __split__=1 as miniXmlData;
Save what we got until now (Optinal)
save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;
load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
This will avoid computation every time when we want to get miniXmlData.
In production, you may will use cache(memory and disk), you can use it like this:
!cache miniXmlData script;
You do not need to release it mannually, MLSQL Engine will take care it.
Use TF/IDF to process content
train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;
Again register the model as a functioin:
register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;
Save what we got until now (Optinal)
save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;
Again, you can cache the trainData.
Cut the feature size
The feature size generated by ET tfidf is > 60w, this will slow down the performance, here we use vec_range to subrange the vector:
select vec_range(content,array(0,10000)) as content,label from trainData as trainData;
There are so many vector related functions in MLSQL, check here if you are interested in.
Train RandomForest
train trainData as RandomForest.`/tmp/nlp/rf` where
keepVersion="true"
and fitParam.0.featuresCol="content"
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
;
you can use fitParam.group to configure multi-group params, like this:
train trainData as RandomForest.`/tmp/nlp/rf` where
keepVersion="true"
and fitParam.0.featuresCol="content"
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
and fitParam.1.featuresCol="content"
and fitParam.1.labelCol="label"
and fitParam.1.maxDepth="3"
and fitParam.1.checkpointInterval="100"
and fitParam.1.numTrees="10"
;
Then MLSQL Engine will generate two models.
image.pngRegister the model as a function:
register RandomForest.`/tmp/nlp/rf` as rf_predict;
Predict
End to end predict, you can also deploy this as an API service.
Do not forget to subrange the tfidf feature:
select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;
As you can see, we use all functions registered before which make us convert raw data to finally string category. And the code is clear:
- use tfidf_predict to generate vector
- use vec_range to subrange the vector
- use rf_predict to get the number category
- use convert_label_r convert number to string
Most the time, you may train several times, and if you wanna see the history,
use command like this:
!model history /tmp/nlp/rf;
image.png
How to deploy API service
Just start MLSQL Engine with local mode, and then you can post http://127.0.0.1:9003/model/predict
with follow params:
dataType=row
data=[{"content":"新闻不错"}]
sql=select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict(content),array(0,10000))))) as predicted
That's All.
Bonus
Thanks to the include statement and the script store support, if you have set up the MLSQL stack, you can use the script from the store immediately:
include store.`/alg/text_classify.mlsql`;
!textClassify /tmp/nlp/sogo/news_sohusite_xml.dat /tmp/nlp2;
!textPredict "新闻很不错";
MLSQL Engine will download script from repo.store.mlsql.tech
automatically.
Any script you have written can be wrap as a command and used by others.
The Final Complete Script
-----------------------------------------
-- Download from file server.
-- run command as DownloadExt.`` where
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line.
-----------------------------------------
!saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;
-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData;
--extract `sports` from url[http://sports.sohu.com/20070422/n249599819.shtml]
select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp
where temp.labelStr is not null
as rawData;
-- Tips:
----------------------------------------------------------------------------------
-- Try to use the follow sql to explore how many label we have and how they looks like.
--
-- select distinct(split(split(url,"/")[2],"\\.")[0]) as labelStr from rawData as output;
-- select split(split(url,"/")[2],"\\.")[0] as labelStr,url from rawData as output;
----------------------------------------------------------------------------------
-- the label we extract from url is string, and the algorithm RandonForest requires the label is
-- integers. here we use StringIndex to implments this.
-- train a model which can map label to number and vice versa
train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;
-- convert label to number
predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;
-- you can use register to convert a model to a functioin
register StringIndex.`/tmp/nlp/label_mapping` as convert_label;
-- we can reduce the dataset. Because if there are too much data but just get limited resource
-- it may take too long. you can use command line
-- or you can use raw ET:
--
-- run xmlData as RateSampler.``
-- where labelCol="url" and sampleRate="0.9,0.1"
-- as xmlDataArray;
!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
-- then we fetch the xmlDataArray with position one to get the 10% data.
select * from xmlDataArray where __split__=1 as miniXmlData;
-- we can save the result data, because it really take much time.
save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;
load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
-- select * from miniXmlData limit 10 as output;
--convert the content to tfidf format
train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;
-- again register a model as a functioin
register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;
save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;
-- the feature size generated by tfidf is > 60w, this will slow down the performance,
-- here we use vec_range to subrange the vector.
select vec_range(content,array(0,10000)) as content,label from trainData as trainData;
-- use algorithm RandomForest to train
-- you can use fitParam.group to congiure multi group params
train trainData as RandomForest.`/tmp/nlp/rf` where
keepVersion="true"
and fitParam.0.featuresCol="content"
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
;
-- register RF model as a functioin
register RandomForest.`/tmp/nlp/rf` as rf_predict;
-- end to end predict; you can also deploy this as a API service
-- do not forget to subrange the tfidf feature
select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;
-- !model history /tmp/nlp/rf;