Scraping and Converting Data with Hpricot and JSON in Ruby


Did you ever wonder how Warren Buffet managed to get that rich? Pure luck my friend, the man was born with perfect timing, one year after the stock market crash of 1929, at the beginning of the great depression.

Warren is a value investor, his specialty is finding companies with depressed valuations and then waiting for – or actively working to get – the rest of the market to realize these values, at which point he can take home handsome profits.

Imagine then how much opportunity there was in the 50’s and 60’s for this kind of investment strategy when the stock market was still hammered. Well guess what, we’ve come full circle, at least in Thailand. The SET topped at about 1600 in 96, today, more than 10 years later we’re at 460, and more than 40 different companies from a wide array of sectors are now valued lower than their booked value, despite being solid. The average Thai company has a P/E of 6.33 at the time of writing. Warren would probably call this really good, possibly even great.

The problem now is actually analyzing all these companies, there are more than 600 of them, just because a P/E is low doesn’t mean investing in said company is a great idea. We have to try and find out if the low P/E is justified or not. In fact Warren himself worked for The Grandmaster in such a capacity.

With that in mind I’d like to have all the fundamentals in my own database so that I could for instance run something like this:

SELECT * FROM companies WHERE pe < 7 AND dividend_yield_percent > 10 AND pbv < 1.5

The result of that query would return really nice candidates if there are any that good. I intend to find out so I need the data, time for some scraping!

#!ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'json'

def parse_company(ticker)
  doc = Hpricot(open("http://www.set.or.th/set/companyhighlight.do?symbol=#{ticker}&language=en&country=US"))
  temp = []
  (doc/"td").each do |td|
    if td.inner_html.strip == "N.A."
      temp << -1
    elsif td.inner_html.to_f != 0.0
      temp << td.inner_html.gsub(/,/, '').to_f
    end
  end
  result = [[], [], [], [], []]
  while temp.length > 0
    result.each_index do |i|
      result[i] << temp.shift
    end
  end
  return result
end

companies = File.new("companies.txt")
companies.each_line do |line|
  ticker = line.strip.split.first
  unless FileTest.file?("#{ticker}.txt")
    data = File.new("#{ticker}.txt", "w+")
    data.write(JSON.generate(parse_company(ticker)))
    data.close
  end
end

I have a file with the tickers of all the companies in the SET in companies.txt, we use it to loop through all companies.

If we already have a file with fundamentals for any given company we don’t do anything, we’re only interested in scraping unknown companies. We use parse_company followed by a call to JSON, we will decode the JSON in PHP at a later stage.

First we begin by opening the given URL and create the Hpricot object with its contents. Next we use the Hpricot object to get at all cells in the whole document. If a cell purely contains “N.A.” or a number we are interested in it because that is the fundamentals we are after. We will also remove commas from the numbers so that we can work with them in the form of floats in MySQL later on.

Next we initiate an array of five arrays, one for each year we have data, in the particular tables we are scraping. The result of the loop is that we get each column in each sub-array, great, just what we want for easy inserting into the database.

Here’s an alternative that isn’t using Hpricot:

require 'rubygems'
require 'open-uri'
require 'json'

def parse_company(ticker)
  str = URI.parse("http://www.set.or.th/set/companyhighlight.do?symbol=#{ticker}&language=en&country=US").read
  temp = []
  str.scan(/(width="100">)([\d.,NA-]+)(<\/td>)/){|a, b, c|
    if b.strip == "N.A."
      temp << -1
    elsif b.to_f != 0.0
      temp << b.gsub(/,/, '').to_f
    end
  }
  result = [[], [], [], [], []]
  result = result[0..(temp.length / 16)]
  while temp.length > 0
    result.each_index do |i|
      result[i] << temp.shift
    end
  end
  return result
end

companies = File.new("companies.txt")
companies.each_line do |line|
  ticker = line.strip.split.first
  unless FileTest.file?("results/#{ticker}.txt")
    data = File.new("results/#{ticker}.txt", "w+")
    data.write(JSON.pretty_generate(parse_company(ticker)))
    data.close
  end
end

Update: I just ran this SQL:

SELECT DISTINCT c.* FROM fundamentals f, company c WHERE f.pe < 8 AND f.dividend > 6 AND f.company = c.id AND f.year = '2010-06-01'

With this result, it’s the top of 103 hits:

screenshot.png

Oh you want the rest of it? C’mon give me a break here, I’m not a charity.

Another useful snippet:

SELECT DISTINCT c.name, c.ticker, f.pe
FROM company c, fundamentals f
WHERE c.ticker
IN (
'2S', 'CHOTI', 'GFM', 'TLUXE', 'TMD', 'TPAC', 'UPF'
)
AND f.company = c.id
AND f.year = '2010-06-01'
ORDER BY f.pe DESC

And another one for half year/Q2 results:

SELECT DISTINCT c.ticker, (f.cap / ( f.nprofit *2 )) AS pe FROM fundamentals f, company c
WHERE f.year = '2006-06-01' AND pe < 7 AND pe > 0
AND f.company = c.id ORDER BY pe ASC

Double checking a company:

SELECT DISTINCT *
FROM fundamentals f, company c
WHERE f.year = '2006-06-01'
AND c.ticker = 'GFM'
AND f.company = c.id

Getting a list of pe’s for select companies:

SELECT DISTINCT c.ticker, (f.cap / ( f.nprofit *2 )) AS pe FROM fundamentals f, company c
WHERE f.year = '2006-06-01' AND c.ticker IN('CHOTI','GFM','PDI','PL','STA','STPI','TLUXE','TMD','UPF')
AND f.company = c.id ORDER BY pe ASC

Having added Swedish companies to the database I can now do for instance this:

SELECT DISTINCT c.name, f.pe, f.dividend
FROM fundamentals f, company c
WHERE f.pe <15
AND f.dividend >2
AND f.company = c.id
AND f.year = '2010-12-31'
AND c.country_code = 'SE'
ORDER BY f.pe

An example with comments and tags for the first quarter:

SELECT DISTINCT c.ticker, c.tag, c.comments, (f.cap / ( f.nprofit * 4 )) AS pe FROM fundamentals f, company c
WHERE f.year = '2011-03-30' AND pe < 7 AND pe > 0
AND f.company = c.id ORDER BY pe ASC
SELECT DISTINCT c.ticker, c.tag, c.comments, f.pe 
FROM fundamentals f, company c
WHERE f.year = '2012-03-21' AND f.pe < 12 AND f.pe > 0
AND f.company = c.id ORDER BY f.pe ASC


Related Posts

Tags: , , ,