Automated Web-Scraping for Real Estate Data with AWS, Git & Python
Craigslist offers a cornucupia of data on all sorts of markets, one of richest of which is real estate (of both the residential and commercial variety). That data, if tracked effectively over time, could provide valuable insights about market rate, average sale cycle, or rate of increase in valuation of commercial real estate in particular neighborhoods and cities. In this project, I extend an existing python library (craigslist_scraper) to capture the most relevant data in a given commercial real estate post, and write it to an AWS cloud database (RDS). Using a crontab on an AWS S3 server, I can automate the scraper to run every day on new posts.
Extending Python’s craigslist_scraper library
I found this library on stackoverflow, and it is simple and functional. Calling craigslist_scraper.scraper(url) on a url returns an object with methods corresponding to various features of the post. The library is built on top of BeautifulSoup4, and object returned by scrape_url contains a .soup object (as well as a .title and .price object and a few others) which I used to extend the library. Ipython can tell you all of the attributes of an object:
In [4]: sc_object= scraper.scrape_url(urls[0])
In [5]: sc_object.
sc_object.attrs sc_object.parse_attr sc_object.parse_string sc_object.soup sc_object.get_text sc_object.parse_int sc_object.price sc_object.title
As you can see below, while title and price are methods of the object, a lot of data that would be specifically valuable for real estate analysis is missed. Specifically, there are patterns in the construction of the webpages which would allow us to identify address, longitude, latitude, total sqfeet, the full title of the posting, and the neighborhood of the posting. all of the data is valuable for our purposes. My AddMeta function uses various bs4 methods to parse the soup and return that data.
def AddMeta(x):
if x.soup:
try:
x.address = x.soup.findAll(attrs={'class': 'mapaddress'})[0].text
except:
x.address = None
try:
x.latitude = float(re_data.soup.findAll(attrs={'class': 'mapbox'})[0].findChildren()[0].get('data-latitude'))
except:
x.latitude = None
try:
x.longitude = float(re_data.soup.findAll(attrs={'class': 'mapbox'})[0].findChildren()[0].get('data-longitude'))
except:
x.longitude = None
try:
x.sqfeet = int(x.soup.findAll(attrs={'class': 'housing'})[0].text.replace('/','').replace('ft2','').replace('-','').strip())
except:
x.sqfeet = None
try:
x.fulltitle = x.soup.findAll(attrs={'class': 'postingtitletext'})[0].text.replace(' \n\n\nhide this posting\n\n\n\n unhide\n \n\n','').replace('\n','')
except:
x.fulltitle = None
try:
x.neighborhood = x.fulltitle[x.fulltitle.find('(')+1:x.fulltitle.find(')')]
except:
x.neighborhood = None
else:
pass
Cloud databases and automation with AWS
The free tier of AWS allows one to provision a cloud server instance and a database instance (I use postgres by default). The administration of these services isn’t horribly complicated, but I’ll skip that part for this posting. Once a linux server is provisioned and I have ssh’ed in for the first time, the following commands provision a virtual environment, install git, and clone my repository onto the server:
virtualenv venv
source venv/bin/activate
sudo yum install git
ssh-keygen
cat /home/ec2-user/.ssh/id_rsa.pub
git clone git@github.com:willpettengill/real-estate-craig.git
The key generated by ssh-keygen should be added to one’s user profile on git. Now, open a crontab with crontab -e, and schedule a job to run the python script daily at 2 am with
0 2 * * * python /real-estate-craig/cl_scraper.py
and watch in amazement as the script populates the database over the course of the day (scrape slowly so that craiglist doesn’t ban your IP!)
Connect & Query Database
Connect to the AWS database with the following command (password, when prompted, is wpettengill):
psql --host=craigslistdb.cnc0ky2ic2hk.us-east-1.rds.amazonaws.com --port=5432 --username=wpettengill --password --dbname=scraper
Write a query in psql like the following:
Select date, count(*) from re_data group by 1;
| date | count |
|---|---|
| 2017-05-27 | 431 |
| 2017-05-18 | 322 |
| 2017-05-20 | 430 |
| 2017-05-21 | 431 |
| 2017-05-29 | 430 |
| 2017-05-26 | 430 |
| 2017-05-19 | 170 |
| 2017-05-23 | 430 |
| 2017-05-24 | 431 |
| 2017-05-25 | 430 |
| 2017-05-30 | 257 |
| 2017-05-22 | 430 |
| 2017-05-28 | 430 |
Cool - even this simple query gives us some baseline insight about the frequency and volatility of real estate postings in new york. Here’s another example, showing the 10 most popular neighborhoods and their associated price per square foot. (I’ll footnote that more sophisticated scraping is probably necessary to get reliable price/sqfoot data)
select neighborhood, count(*) as total_listings, sum(price)::float/sum(sqfeet)::float as price_per_sq_foot group by 1 order order by 2 desc limit 10;
| neighborhood | total_listings | price_per_sq_foot |
|---|---|---|
| Midtown West | 817 | 4.44865584715869 |
| Chelsea | 457 | 5.28321017993301 |
| Midtown | 442 | 4.23941576645336 |
| Midtown East | 428 | 5.17031487908353 |
| Financial District | 271 | 3.59197729316897 |
| Flatiron | 193 | 5.08019272421872 |
| Upper East Side | 170 | 6.61464371544337 |
| Lower East Side | 164 | 4.78085423035374 |
| SoHo | 146 | 4.95244297942474 |
| Bushwick | 123 | 6.70587688900004 |