<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Keith's Web Blog RSS Feed</title>
<language>en-us</language>
<link>http://www.keithwatanabe.net/index.php</link>
<description>Keith Watanabe's Website</description>
<item>
<title>Build First, Optimize Later</title>
<link>http://www.keithwatanabe.net/blogs/2008/1/26/4a89957da3e91ac5b85e4421992ed607.html</link>
<description><![CDATA[My application at work met an interesting challenge when one part ran incredibly slow.  I never bothered benchmarking the thing initially and didn't consider the possibility of a large data set that would hit performance.  Well, I got hit by <em>that</em> large data set and my application would take far too long, doing loops around a data set of 14000+ items squared.  Naturally, this was AWFUL so I had to really consider how to beef up performance.  Mostly, eliminating the secondary loop, causing the redundant loop to occur.  <br />
<br />
I ended up spending a good day banging my head over the problem.  After a good night's rest, I figured that I needed to start by removing any calls that would be performance intensive, especially iterative.  I found various database calls and some existing data elements that I could pass into the subroutine, thus improving speed somewhat.  But the main double loop was the killer problem.  Originally, I had written this one particular method to be reusable, in other words good economic, environment friendly programming.  However, sometimes it doesn't pay to be so economic at times if you don't see the overall impact a situation might have.<br />
<br />
So I ended up re-writing a specialized routine for that piece of code that would pass in a previously established data set, kinda like a cache effect.  Space didn't matter in terms of memory usage as my application has limited users.  So doing a pass-by-value to the routine mattered little.  After re-integrating this piece of code into the application (with a few more performance enhancements), this little baby was soaring.  In fact, compared to the original application, it really boosted performance!  The performance boost was so great that my manager swore to rebuild his stuff after seeing the obvious difference in speed for the same data set.<br />
<br />
Despite this anecdotal, what does my story have to do with the title?  The thing is that I wanted to at least get the application out there.  I'm on a deadline so I need to push things out fast.  I didn't have any assumptions when I first built it, therefore my initial purpose was to get it to work on some sample data.  More than that, I didn't understand enough the first time in trying to move it out.  So it was more important to demonstrate a workable demo rather than an error proof, solid, high octane application.  In other words, get it done and learn the first time what you're trying to figure out.  Then once you understand the problem better, improve on the answer of your problem.  That's the moral of this story.]]></description>
<pubDate>Sat, 26 Jan 2008 21:30:33 -0700</pubDate>
<guid>http://www.keithwatanabe.net/blogs/2008/1/26/4a89957da3e91ac5b85e4421992ed607.html</guid>
</item>
<item>
<title>PHP Variable Interpolation</title>
<link>http://www.keithwatanabe.net/blogs/2008/1/28/4e5c75bba88c6ee08fe3cdf3d5aaabc6.html</link>
<description><![CDATA[As you progress in a language, you'll probably end up getting bored or better yet, attempting to figure out the limits of a language.  I'm getting to that point with <strong>PHP</strong>.  Along the way, I wanted to find some neat tricks to either improve the performance of my code, or to make my code more compact, even <strong>perl-like </strong>(i.e. one liners).  I find a neat little trick you can do in a string that's not very well known.   <br />
<br />
For the longest time, I would concatenate strings together if I wanted to use a method call or remove single quotes from hashes if I'd use them in a string.  For instance:<br />
<br />
$str = &quot;Hello &quot; . $this-&gt;getFirstName() . &quot; &quot; . $this-&gt;getLastName();<br />
$out = &quot;My name is $arr[fname] $arr[lname]&quot;;<br />
<br />
It's said that both have some performance issues.  In perl, the second one would be considered decent format, since they often discourage you from quote in a hash.  However, in PHP, the second form is considered a bit slower because of some internal conversions that occur with the interpreter.  So how can we handle this in a slicker manner?  Use <em>curly braces</em> {}. <br />
<br />
$str = &quot;Hello {$this-&gt;getFirstName()} {$this-&gt;getLastName()}&quot;;<br />
$out = &quot;My name is {$arr['fname']} {$arr['lname']}&quot;;<br />
<br />
Both styles are considered good form in PHP, if not bulking up your strings a bit.  But you can avoid nasty concatenations or some performance hits by doing this.<br />
<br />
Another little trick for handling variable interpolation, especially if you do a lot of inline messages (which isn't a good thing to do, but sometimes can't be helped), is using the <strong>heredoc</strong> format.  Heredoc is a nice technique for avoiding the string concatenation problem.  It makes the code appear a little uglier, but you don't take the performance hit of making something more readable by forcing the interpreter to work a little harder.  The format goes like this:<br />
<br />
$str&lt;&lt;&lt;EOL<br />
My favorite Japanese famous women are:<br />
1) Reina Miyauchi<br />
2) Norika Fujiwara<br />
3) Shiho<br />
4) Kaori Kawai<br />
5) Ryoko Yonekura<br />
EOL<br />
<br />
Perl and other Unix scripting languages provide similar mechanisms.  You can put variables and even make method calls in the heredoc as well.  In terms of when to use such a device, my recommendation is SQL when you need to create long statements and lack some nice ORM to handle this internally for you.  If you have to create email messages or insert HTML into your code, you might as well just skip right over to a templating system like Smarty instead.  Of course, if your application is small, then a heredoc wouldn't be considered out of the question either.]]></description>
<pubDate>Mon, 28 Jan 2008 08:12:22 -0700</pubDate>
<guid>http://www.keithwatanabe.net/blogs/2008/1/28/4e5c75bba88c6ee08fe3cdf3d5aaabc6.html</guid>
</item>
<item>
<title>Massive Log Files</title>
<link>http://www.keithwatanabe.net/blogs/2008/3/29/ab18e01982fa54768fb407d42446d015.html</link>
<description><![CDATA[I had my first opportunity to do some work on extremely large log files.  Actually, at work I'm running a batch script that is importing data and I'm quite certain it's still not finished.  The thing is that I'm attempting to do some log analysis using Mysql.  I managed to put one client's logs, which reached over 4 million rows worth of data for a span of nearly three months.  The real test was in my current client where just a few days implied at least a gig's worth of data.<br />
<br />
But through this experience, I learned quite a bit from both the Mysql side and PHP.  First, PHP <strong><em>really</em></strong> needs a threading model.  I cannot emphasize this enough.  Most people probably are happy just to do some multi-tasking version of threads, but that's not &quot;true&quot; threads and you really are hacking a model together.  I've seen fake threading packages here and there, but nothing that warrants trying it out.  I'm starting to see why Java and other languages with threads natively built into them are considered superior in this regard.<br />
<br />
Second, global variables are not necessarily an evil.  What!?!?!??!?!?!  Blasphemer!  Actually, using global variables on a few scenarios saved my butt because passing them by value in a function's method calls inflated my memory usage to double the capacity.  In those cases, I was loading up massive arrays of information where I needed to re-index some information before loading it into the database.  Here's a situation where global variables are actually a GREAT thing because you want to save memory.  The key here is that I know which variables are considered global so I can control from a program flow perspective whether or not these variables would have any side effects in the end.<br />
<br />
On the other hand, I found objects to be evil.  I was using the Zend Framework's models to load part of my data into memory.  When you load a rowset using the Zend Framework, you actually are committing yourself to building up VERY fat objects into memory.  On small data sets, this doesn't really matter.  Most websites on the front end probably don't have to bother with such details.  However, when you're loading several hundred rows, it REALLY matters.  Fortunately, the rowset object allows you to convert the data into pure arrays.  By doing this, I was able to cut memory usage in half.  This was a critical step because I had to bump up the memory allocation to near 2gb and the script was going into swap.<br />
<br />
From the Mysql side, I'm learned a thing or two about indexing.  Most people probably just know about indexes from foreign keys, unique columns or surrogate keys (i.e. sequences).  But indexes can be used in more ways to boost up your queries.  Here, I was creating an equivalent log file table using with month and day columns.  Quite often a person might want to check out the number of unique hits per month, or even for a day.  On a data set of say 10 million rows, this can take roughly 20 seconds.  Putting an index on day or month can cut the time in half.<br />
<br />
One thing I was doing in designing the database was structuring it similar to how dimensional modeling works.  Mostly, just one off join tables (I believe these are called the dimensions) from the fact table which was the log table itself.  I had several large fact tables requiring me to do up to six table joins at a table.  Not deep table joins, but just connecting them to lookup type tables.  I tried creating a view to see if I could optimize the select speed.  Turns out that the view took longer than a regular query.  The reason here is that I would use some filter prior to joining up the tables.  The view had no such filter and would grab the entire data set in a joint format before allowing you to run your filters.<br />
<br />
Overall, the experience has been very cool because the typical highest number of rows I'd work with might be a few hundred thousand.  Here, I'm working with several million, so learning to optimize my code and queries allow me to get a much better feeling for how optimal programming works.]]></description>
<pubDate>Sat, 29 Mar 2008 01:30:15 -0600</pubDate>
<guid>http://www.keithwatanabe.net/blogs/2008/3/29/ab18e01982fa54768fb407d42446d015.html</guid>
</item>
</channel>
</rss>
