<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Keith's Web Blog RSS Feed</title>
<language>en-us</language>
<link>http://www.keithwatanabe.net/index.php</link>
<description>Keith Watanabe's Website</description>
<item>
<title>Massive Log Files</title>
<link>http://www.keithwatanabe.net/blogs/2008/3/29/ab18e01982fa54768fb407d42446d015.html</link>
<description><![CDATA[I had my first opportunity to do some work on extremely large log files.  Actually, at work I'm running a batch script that is importing data and I'm quite certain it's still not finished.  The thing is that I'm attempting to do some log analysis using Mysql.  I managed to put one client's logs, which reached over 4 million rows worth of data for a span of nearly three months.  The real test was in my current client where just a few days implied at least a gig's worth of data.<br />
<br />
But through this experience, I learned quite a bit from both the Mysql side and PHP.  First, PHP <strong><em>really</em></strong> needs a threading model.  I cannot emphasize this enough.  Most people probably are happy just to do some multi-tasking version of threads, but that's not &quot;true&quot; threads and you really are hacking a model together.  I've seen fake threading packages here and there, but nothing that warrants trying it out.  I'm starting to see why Java and other languages with threads natively built into them are considered superior in this regard.<br />
<br />
Second, global variables are not necessarily an evil.  What!?!?!??!?!?!  Blasphemer!  Actually, using global variables on a few scenarios saved my butt because passing them by value in a function's method calls inflated my memory usage to double the capacity.  In those cases, I was loading up massive arrays of information where I needed to re-index some information before loading it into the database.  Here's a situation where global variables are actually a GREAT thing because you want to save memory.  The key here is that I know which variables are considered global so I can control from a program flow perspective whether or not these variables would have any side effects in the end.<br />
<br />
On the other hand, I found objects to be evil.  I was using the Zend Framework's models to load part of my data into memory.  When you load a rowset using the Zend Framework, you actually are committing yourself to building up VERY fat objects into memory.  On small data sets, this doesn't really matter.  Most websites on the front end probably don't have to bother with such details.  However, when you're loading several hundred rows, it REALLY matters.  Fortunately, the rowset object allows you to convert the data into pure arrays.  By doing this, I was able to cut memory usage in half.  This was a critical step because I had to bump up the memory allocation to near 2gb and the script was going into swap.<br />
<br />
From the Mysql side, I'm learned a thing or two about indexing.  Most people probably just know about indexes from foreign keys, unique columns or surrogate keys (i.e. sequences).  But indexes can be used in more ways to boost up your queries.  Here, I was creating an equivalent log file table using with month and day columns.  Quite often a person might want to check out the number of unique hits per month, or even for a day.  On a data set of say 10 million rows, this can take roughly 20 seconds.  Putting an index on day or month can cut the time in half.<br />
<br />
One thing I was doing in designing the database was structuring it similar to how dimensional modeling works.  Mostly, just one off join tables (I believe these are called the dimensions) from the fact table which was the log table itself.  I had several large fact tables requiring me to do up to six table joins at a table.  Not deep table joins, but just connecting them to lookup type tables.  I tried creating a view to see if I could optimize the select speed.  Turns out that the view took longer than a regular query.  The reason here is that I would use some filter prior to joining up the tables.  The view had no such filter and would grab the entire data set in a joint format before allowing you to run your filters.<br />
<br />
Overall, the experience has been very cool because the typical highest number of rows I'd work with might be a few hundred thousand.  Here, I'm working with several million, so learning to optimize my code and queries allow me to get a much better feeling for how optimal programming works.]]></description>
<pubDate>Sat, 29 Mar 2008 01:30:15 -0600</pubDate>
<guid>http://www.keithwatanabe.net/blogs/2008/3/29/ab18e01982fa54768fb407d42446d015.html</guid>
</item>
</channel>
</rss>
