Yelp: Reverse geocoding businesses to extract detailed location information
I’ve been playing around with the Yelp Open Dataset and wanted to extract more detailed location information for each business.
This is an example of the JSON representation of one business:
$ cat dataset/business.json | head -n1 | jq
{
"business_id": "FYWN1wneV18bWNgQjJ2GNg",
"name": "Dental by Design",
"neighborhood": "",
"address": "4855 E Warner Rd, Ste B9",
"city": "Ahwatukee",
"state": "AZ",
"postal_code": "85044",
"latitude": 33.3306902,
"longitude": -111.9785992,
"stars": 4,
"review_count": 22,
"is_open": 1,
"attributes": {
"AcceptsInsurance": true,
"ByAppointmentOnly": true,
"BusinessAcceptsCreditCards": true
},
"categories": [
"Dentists",
"General Dentistry",
"Health & Medical",
"Oral Surgeons",
"Cosmetic Dentists",
"Orthodontists"
],
"hours": {
"Friday": "7:30-17:00",
"Tuesday": "7:30-17:00",
"Thursday": "7:30-17:00",
"Wednesday": "7:30-17:00",
"Monday": "7:30-17:00"
}
}
The businesses reside in different countries so I wanted to extract the area/county/state and the country for each of them. I found the reverse-geocoder library which is perfect for this problem.
You give the library a lat/long or list of lat/longs and it returns you back a list containing the nearest lat/long to your points along with the name of the place, Admin regions, and country code. It’s way quicker to pass in a list of lat/longs than to call the function individually for each lat/long so we’ll do that.
We can write the following code to extract location information for a list of lat/longs:
import reverse_geocoder as rg
lat_longs = {
"FYWN1wneV18bWNgQjJ2GNg": (33.3306902, -111.9785992),
"He-G7vWjzVUysIKrfNbPUQ": (40.2916853, -80.1048999),
"KQPW8lFf1y5BT2MxiSZ3QA": (33.5249025, -112.1153098)
}
business_ids = list(lat_longs.keys())
locations = rg.search(list(lat_longs.values()))
for business_id, location in zip(business_ids, locations):
print(business_id, lat_longs[business_id], location)
This is the output we get from running the script:
$ python blog.py
Loading formatted geocoded file...
FYWN1wneV18bWNgQjJ2GNg (33.3306902, -111.9785992) OrderedDict([('lat', '33.37088'), ('lon', '-111.96292'), ('name', 'Guadalupe'), ('admin1', 'Arizona'), ('admin2', 'Maricopa County'), ('cc', 'US')])
He-G7vWjzVUysIKrfNbPUQ (40.2916853, -80.1048999) OrderedDict([('lat', '40.2909'), ('lon', '-80.10811'), ('name', 'Thompsonville'), ('admin1', 'Pennsylvania'), ('admin2', 'Washington County'), ('cc', 'US')])
KQPW8lFf1y5BT2MxiSZ3QA (33.5249025, -112.1153098) OrderedDict([('lat', '33.53865'), ('lon', '-112.18599'), ('name', 'Glendale'), ('admin1', 'Arizona'), ('admin2', 'Maricopa County'), ('cc', 'US')])
It seems to work fairly well! Now we just need to tweak our script to read in the values from the Yelp JSON file and generate a new JSON file containing the locations:
import json
import reverse_geocoder as rg
lat_longs = {}
with open("dataset/business.json") as business_json:
for line in business_json.readlines():
item = json.loads(line)
if item["latitude"] and item["longitude"]:
lat_longs[item["business_id"]] = {
"lat_long": (item["latitude"], item["longitude"]),
"city": item["city"]
}
result = {}
business_ids = list(lat_longs.keys())
locations = rg.search([value["lat_long"] for value in lat_longs.values()])
for business_id, location in zip(business_ids, locations):
result[business_id] = {
"country": location["cc"],
"name": location["name"],
"admin1": location["admin1"],
"admin2": location["admin2"],
"city": lat_longs[business_id]["city"]
}
with open("dataset/businessLocations.json", "w") as business_locations_json:
json.dump(result, business_locations_json, indent=4, sort_keys=True)
And that’s it!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.