Week3 hw2: Draw a Histogram in j

2016-05-30 本文已影响0人快要没时间了

In the previous study (week2 homework), we have already get all item info in ganji.com. Now we are going to draw a histogram in juypter-notebook with charts module.

Target

Import Json File into mongo

If there is a json file like this:

[ 
{
"title":"Introduction",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
"description":""
}
,
{
"title":"Conjugate priors",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
"description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
}
]

It can be import to mongo as a collection by 2 steps.

Create a empty collection. (mongo Shell)

db.creatCollection('newCollect')

Use mongoimport. (in Terminal)

mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray

also, it can be write as:

mongoimport -d dbName -c collectName path/file.json

Data Cleaning

Type "jupyter notebook" in terminal. Start a demon.

jupyter notebook

Now, we can use jupyter in Safari (localhost:8888).
Here is a data info. It's obvious to classify those items by the url value.

{'title': '【图】很新的海信冰箱 - 西城西单二手家电 - 北京58同城', 'price': 260, 'look': '-', 'area': ['西城', '西单'], 'time': 0, '_id': ObjectId('5698f525a98063dbe6e91ca8'), 'cates': ['北京58同城', '北京二手市场', '北京二手家电', '北京二手冰箱'], 'pub_date': '2016.01.13', 'url': 'http://bj.58.com/jiadian/24652878967613x.shtml'}

import pymongo
import charts

client = pymongo.MongoClient('localhost',27017)
myDB = client['ganjiDB']
myCollection = myDB['bjGanji']

for i in myCollection.find().limit(200):
    url = i['url']
    cate = url.split('/')[3]
    print(cate)

How to classify

Since it works well, the following is much easier.
Can we just use .find() method to select all items whose url contain the key word?
So I check a cookbook of mongo. Unfortunately, mongo's basic Conditional operator only works for numbers. They are $gte or $lte.
Now, I have to use set and list to get the number of each category recurring.

cate_list = []
for each in myCollection.find():
    url = each['url']
    cate = url.split('/')[3]
    cate_list.append(cate)
cate_index = (set(cate_list))
print(cate_index)
print(len(cate_list),len(cate_index))

the result is here:

{'yingyou', 'ershoujiaju', 'fushi', 'meirong', 'ershoushebei', 'bangong', 'pingbandiannao', 'tushu', 'tiaozao', 'wenti', 'shouji', 'shuma', 'diannao', 'jiadian', 'bijibendiannao'}
86850 15

Draw a Histogram

import charts

series = []
for each in cate_index:
    dat = {
        'name':each,
        'data':[cate_list.count(each)],
        'type':'column'
    }   
    print(dat)
    series.append(dat)

options = {
    'title':{'text':'Post Numbers in each Category'}
}
print(options)

Result is here

{'name': 'yingyou', 'type': 'column', 'data': [7819]}
{'name': 'ershoujiaju', 'type': 'column', 'data': [4891]}
{'name': 'fushi', 'type': 'column', 'data': [9990]}
{'name': 'meirong', 'type': 'column', 'data': [2794]}
{'name': 'ershoushebei', 'type': 'column', 'data': [1639]}
{'name': 'bangong', 'type': 'column', 'data': [6461]}
{'name': 'pingbandiannao', 'type': 'column', 'data': [1525]}
{'name': 'tushu', 'type': 'column', 'data': [4221]}
{'name': 'tiaozao', 'type': 'column', 'data': [1143]}
{'name': 'wenti', 'type': 'column', 'data': [9510]}
{'name': 'shouji', 'type': 'column', 'data': [2822]}
{'name': 'shuma', 'type': 'column', 'data': [7666]}
{'name': 'diannao', 'type': 'column', 'data': [4855]}
{'name': 'jiadian', 'type': 'column', 'data': [18863]}
{'name': 'bijibendiannao', 'type': 'column', 'data': [2651]}
{'title': {'text': 'Post Numbers in each Category'}}

Series and options are two fixed variables for charts, which looks like JavaScript.

charts.plot(series=series, show='inline', options=options)

QQ20160530-1.png

Appendix

chartsDemo