Will Pettengill

View on GitHub

Automated Web-Scraping for Real Estate Data with AWS, Git & Python

Craigslist offers a cornucupia of data on all sorts of markets, one of richest of which is real estate (of both the residential and commercial variety). That data, if tracked effectively over time, could provide valuable insights about market rate, average sale cycle, or rate of increase in valuation of commercial real estate in particular neighborhoods and cities. In this project, I extend an existing python library (craigslist_scraper) to capture the most relevant data in a given commercial real estate post, and write it to an AWS cloud database (RDS). Using a crontab on an AWS S3 server, I can automate the scraper to run every day on new posts.

Extending Python’s craigslist_scraper library

I found this library on stackoverflow, and it is simple and functional. Calling craigslist_scraper.scraper(url) on a url returns an object with methods corresponding to various features of the post. The library is built on top of BeautifulSoup4, and object returned by scrape_url contains a .soup object (as well as a .title and .price object and a few others) which I used to extend the library. Ipython can tell you all of the attributes of an object:

In [4]: sc_object= scraper.scrape_url(urls[0])

In [5]: sc_object. sc_object.attrs sc_object.parse_attr sc_object.parse_string sc_object.soup sc_object.get_text sc_object.parse_int sc_object.price sc_object.title

As you can see below, while title and price are methods of the object, a lot of data that would be specifically valuable for real estate analysis is missed. Specifically, there are patterns in the construction of the webpages which would allow us to identify address, longitude, latitude, total sqfeet, the full title of the posting, and the neighborhood of the posting. all of the data is valuable for our purposes. My AddMeta function uses various bs4 methods to parse the soup and return that data.


def AddMeta(x):
	if x.soup:
		
		try:
			x.address = x.soup.findAll(attrs={'class': 'mapaddress'})[0].text
		except:
			x.address = None	
		try:	
			x.latitude = float(re_data.soup.findAll(attrs={'class': 'mapbox'})[0].findChildren()[0].get('data-latitude'))
		except:
			x.latitude = None	
		try:	
			x.longitude = float(re_data.soup.findAll(attrs={'class': 'mapbox'})[0].findChildren()[0].get('data-longitude'))
		except:
			x.longitude = None
		try:	
			x.sqfeet = int(x.soup.findAll(attrs={'class': 'housing'})[0].text.replace('/','').replace('ft2','').replace('-','').strip())
		except:
			x.sqfeet = None
		try:	
			x.fulltitle = x.soup.findAll(attrs={'class': 'postingtitletext'})[0].text.replace(' \n\n\nhide this posting\n\n\n\n    unhide\n  \n\n','').replace('\n','')
		except:
			x.fulltitle = None
		try:	
			x.neighborhood = x.fulltitle[x.fulltitle.find('(')+1:x.fulltitle.find(')')]
		except:
			x.neighborhood = None
	else:
		pass

Cloud databases and automation with AWS

The free tier of AWS allows one to provision a cloud server instance and a database instance (I use postgres by default). The administration of these services isn’t horribly complicated, but I’ll skip that part for this posting. Once a linux server is provisioned and I have ssh’ed in for the first time, the following commands provision a virtual environment, install git, and clone my repository onto the server:


virtualenv venv

source venv/bin/activate

sudo yum install git

ssh-keygen

cat /home/ec2-user/.ssh/id_rsa.pub

git clone git@github.com:willpettengill/real-estate-craig.git 

The key generated by ssh-keygen should be added to one’s user profile on git. Now, open a crontab with crontab -e, and schedule a job to run the python script daily at 2 am with

0 2 * * * python /real-estate-craig/cl_scraper.py

and watch in amazement as the script populates the database over the course of the day (scrape slowly so that craiglist doesn’t ban your IP!)

Connect & Query Database

Connect to the AWS database with the following command (password, when prompted, is wpettengill):

psql --host=craigslistdb.cnc0ky2ic2hk.us-east-1.rds.amazonaws.com --port=5432 --username=wpettengill --password --dbname=scraper

Write a query in psql like the following:

Select date, count(*) from re_data group by 1;

date count
2017-05-27 431
2017-05-18 322
2017-05-20 430
2017-05-21 431
2017-05-29 430
2017-05-26 430
2017-05-19 170
2017-05-23 430
2017-05-24 431
2017-05-25 430
2017-05-30 257
2017-05-22 430
2017-05-28 430

Cool - even this simple query gives us some baseline insight about the frequency and volatility of real estate postings in new york. Here’s another example, showing the 10 most popular neighborhoods and their associated price per square foot. (I’ll footnote that more sophisticated scraping is probably necessary to get reliable price/sqfoot data)

select neighborhood, count(*) as total_listings, sum(price)::float/sum(sqfeet)::float as price_per_sq_foot group by 1 order order by 2 desc limit 10;

neighborhood total_listings price_per_sq_foot
Midtown West 817 4.44865584715869
Chelsea 457 5.28321017993301
Midtown 442 4.23941576645336
Midtown East 428 5.17031487908353
Financial District 271 3.59197729316897
Flatiron 193 5.08019272421872
Upper East Side 170 6.61464371544337
Lower East Side 164 4.78085423035374
SoHo 146 4.95244297942474
Bushwick 123 6.70587688900004