OpenStreetMap Project Data Wrang

2017-11-14  本文已影响0人  海棉棉

OpenStreetMap Project Data Wrangling with MongoDB

Wang Xiaochu


Map Area

Birmingham, Westmidlands, England

This city is where I studied in the university for a master's degree of urban and reginal planning last year. So study the city from a different perspective using new skills I've learned means a lot to me.

Problems Encountered in the Map

After initially downloading a small sample size of the Birmingham area and running it against a provisional data.py file, I noticed three main problems with the data, which I will discuss in the following order:

Overabbreviated & Misspelled Street Names

Before the data was imported into MongoDB, it was audited in audit.py using the following function:

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
        "Trail", "Parkway", "Commons", "Way", "Walk"]

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

This basic querying revealed street name abbreviations and misspellings. I updated all substrings in problematic address strings, such as:
"Aveune" becomes "Avenue", "G4rove" becomes "Grove".

Postal Codes

Postcodes turn out to be a non-technical problem. It revealed some of the data out of the range of Birmingham city, which force me to filter out all of nodes with postcodes start with "C" or "D" in the process of import data into JSON file with adding following codes:

if m == 'addr:postcode' and elem.attrib['v'][0] != 'B':
    return None 

Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.

File Size

birmingham_england.osm ........... 1,535 MB

birmingham_england.osm.json .... 1,780 MB

Additional Ideas

Region statistics and suggestion

Although some of nodes out of Birmingham region have been filtered by their postcodes, there are still indiscriminable nodes do not have postcode attribute left in the file. One way to solve this problem could be comparing the longitude and latitude of Birmingham city and the attribute of each node. Another way is to ensure the property “addr.city” is properly informed. Using the following query, I explore the statistics further:

    > db.birm.aggregate([{"$match":{"addr.city":{"$exists":1}}},
                        {"$group":{"_id":"$addr.city","count":{"$sum":1}}}, 
                        {"$sort":{"count":-1}}])

    { "_id" : "Birmingham", "count" : 3884 }
    { "_id" : "Solihull", "count" : 910 }
    { "_id" : "Bromsgrove", "count" : 238 }
    { "_id" : "Alcester", "count" : 158 }
    { "_id" : "Sutton Coldfield", "count" : 106 }
    { "_id" : "Tipton", "count" : 63 }
    { "_id" : "Madeley", "count" : 49 }
    .....
    { "_id" : "Ironbridge", "count" : 26 }
    { "_id" : "Wolverhampton", "count" : 22 }
    { "_id" : "bm", "count" : 16 }
    { "_id" : "West Bromwich", "count" : 16 }
    { "_id" : "Redditch", "count" : 13 }

It turns out I didn't wrangle this data properly. Of course better data wrangling should be done ideally, while in the circumstance, the citys are all in the big region of Birmingham, which is just not very influential if I take it as Birmingham metropolitan area.

Additional data exploration using MongoDB queries

Conclusion

OpenStreetMap data presents an ideal opportunity for me to practice data wrangling and a special way to explore citys. Although this review of data is cursory, I think is has been well cleaned for the purposes of the exercise.

References

上一篇下一篇

猜你喜欢

热点阅读