Parsing PHP gettext strings with Ruby

I just had the need to fetch all gettext strings from a whole project. Sure poEdit can normally do this but this project is using Smarty and {t}something to translate{/t}, poEdit can’t do the Smarty stuff so we need something custom.

Below is the script I use at the moment from top to bottom:

require 'find'
require 'fileutils'

def read_po(file)
  po_hsh = {}
  File.new(file).read.scan(/msgid\s"(.*)"\s*msgstr\s"(.*)"/){|a, b| po_hsh[a] = b}
  return po_hsh
end

The above function will be used later to read the current *.po file into a hash, using the strings to translate as keys and the translated strings as values.

def extract_tree(dir, po_file, new_po_file)
  msgids = {}
  Find.find(dir) do |path|
    if(FileTest.file?(path) && (path =~ /\.tpl$/ || path =~ /\.php$/))
      content = File.new(path).read
      if(content =~ /\{t\}/ || content =~ /_\(['"]/)
        content.scan(/(\{t\})(.*?)(\{\/t\})/){|a, b, c| msgids[b] = path }
        content.scan(/(_\(['"])(.*?)(['"]\))/){|a, b, c| msgids[b] = path }
      end
    end
  end
  old_hsh = read_po(po_file)
  out = ""
  wordcount = 0
  msgids.each do |id, path|
    id = id.gsub(/(\\)(["])/, '\2').gsub(/(["])/, '\\\\\1')
    str = old_hsh.has_key?(id) ? old_hsh[id] : ""
    wordcount += id.split.length
    out += '#' + path + "\n"
    out += 'msgid "' + id + '"' + "\n"
    out += 'msgstr "' + str + '"' + "\n\n"
  end
  po = File.new(new_po_file, "w+")
  puts wordcount
  po.write(out)
end

extract_tree('src', 'en.po', 'new_en.po')

We begin with parsing the whole project in the same fashion we did the .po file above. The string to translate will be key and the path where it was found will be value.

Note that we only work with .tpl files and .php files, and only if we find either {t} or _( in them. If we find one of these we continue to scan the whole document in question. Note the use of (.*?) to turn off the greediness of the regexes. This is absolutely necessary, otherwise we would get a big part of the whole document as a result instead of each translation string.

After that we store the old translations that we need to update in old_hsh. Finally we loop through the result we got from parsing the project (msgids). We will escape each double quote. I noticed that gtranslate does that automatically so if we don’t do it before we start working with the stuff we have already translated in old_hsh the keys won’t match even though they should.

The translated string will now be an old string if the key is already existing in the old translation file, if it’s something newly added we work with empty of course.

And each output will for instance look something like this:

#path/to/file.php
msgid "Valda kategorier"
msgstr "Chosen categories"


Related Posts

Tags: , , ,