Knowledgebase

PHP Help

Posted by HostRefugee-Vince, 03-12-2007, 09:26 PM
Hi Everyone, I know a little PHP, but not enough obviously. I need to do something which I think would be quite simple, but I want it to be the least resource intensive as possible. I have 1 site that get's over 100,000 hits per day, and I want to be able to grab from the raw apache logs the last 10 access to PHP pages, and I also want to grab the referrer of those pages. The apache log for this site can get quite large, so opening the entire file with PHP is not want I want to do. If I could load just the last 100 lines from the log, there should be at least 10 php pages in the results. I need this script to ignore access to images and everything else except .php pages. Here is sample output from the raw log: xx.xxx.xxx.xxx - - [12/Mar/2007:03:43:07 -0500] "GET /america/mypage.php HTTP/1.1" 200 20107 "http://www.google.com/search?hl=en&q=search+keywords+here" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)" The output in the browser should be simple, just 1 result per line, for the last 10 PHP page accesses: Sample of desired output: /america/mypage.php - http://www.google.com/search?hl=en&q...+keywords+here Thanks for any input or help you can give.

Posted by Archbob, 03-12-2007, 10:47 PM
Why won't you just use grep command in Unix or perl instead of PHP?

Posted by HostRefugee-Vince, 03-12-2007, 11:03 PM
I'm actually working on that type of solution now. Since I have your attention now This is what I got so far: tail -n100 mydomain.com | awk {'print $7'} > output.txt cat output.txt | grep php > output1.txt cat output1.txt | grep -v contact > output2.txt cat output2.txt | grep -v sitemap > output3.txt The above only gets the accessed files and not the referrers. Plus it's 4 different commands I have to run. The reason I was thinking PHP, is so I can access it from the browser.

Posted by ThatScriptGuy, 03-12-2007, 11:49 PM
Or you could run that script above, and as the last command, have it copy *.txt to the web root and just access the text file

Posted by isurus, 03-13-2007, 04:52 AM
Haven't got much time atm, I'll tidy this up later - there are some performance improvements that can be made too.

Posted by tiamak, 03-13-2007, 04:52 AM
simply run bash command like well you will have to rewrite it a little to fit to your access_log its for access log with such structure so $8 is accessed file and $12 is referer ah and btw it is extreemly fast (outputs result in less than a second for 5413499 lines long access_log )

Posted by isurus, 03-13-2007, 09:14 AM
I setup a 500,000 line dummy log file (106MB) and did a few timings on my workstation (P4 3GHz, 1.5GB RAM). Snippet from the dummy log file: Using tail to grab the last few hundred lines of data: Execution time varies between 0.019s and 0.108s. Execution time varies between 0.052s and 0.097s. Parsing the entire file: Execution time is ~2.75s. Execution time is ~42.0s. Whilst the tail/grep version is faster on average, I would go for the tail/awk version - it's a lot clearer, and hence easier to maintain.

Posted by isurus, 03-13-2007, 09:34 AM
Oops - that should've been Execution time is 13s.

Posted by HostRefugee-Vince, 03-13-2007, 10:06 AM
Actually, the original code you posted works for me, the above doesn't Works great. The only other thing I need this to do, is not count 2 php scripts. The 2 I have used in my example post above, were contact.php and sitemap.php. Those are not the actual files, the actual files are referrer.php and access.php. Both commonly appear in the log and are usually contain variables along with them. Example: referrer.php?referrer=google.com&something=somethingelse&etc=etc The reason I want to exclude the 2 files, is because no matter what page is visited within the site, they will always be accessed as well. In other words, 5 of the 10 results are going to be one of those 2 files. Thanks for everyone's help so far

Posted by MasterGee79, 03-13-2007, 12:32 PM
Would it be that much more server intensive to just create a new log file that stores the last 10 using php? Why do you have to use the apache log?

Posted by HostRefugee-Vince, 03-13-2007, 12:48 PM
I guess it's time to give a little more background. The site in question is a coop between a few different people... and everybody but me likes the way stats are currently done. The person who designed most of the site, made it where every access to every php page (except a few) inserts a row into mysql. This row includes the page visited, referrer and a few other pieces of information. Once a night, the table is emptied and a new day starts. Personally, awstats gives me more than enough statistics. The other people in the coop are not willing to part with the current way of doing things though, unless I give them a similar solution. Anyhow, that's over 100,000 inserts per day... which to me seems a waste of server resources. The site is also 100% php/mysql generated. The page content rarely changes (maybe one page per day changes just a little). I would like to change this so the pages that visitors are accessing are cached. Again, it just seems like a major waste of resources having to run select statements every page access. What I would like to do, is of course come up with a solution for the stats (that everyone but me loves). And also find a way to make cached versions of the pages visitors visit. Here's my proposed solution: When mypage.php is accessed, it checks the date of a file /cached/mypage.html . If mypage.html is older than 30 minutes, it will recreate the page cache from php/mysql. If not, it will echo the page cache. I guess I could also have that php page write to a file: last10accessed.html In my opinion, the above way of doing things would be much less resource intensive... and would save the site from doing 100k SELECTS and 100k INSERTS per day. Maybe I was approaching this the wrong way from the start... Perhaps the proposed solution would be better than grepping the access_logs. Thoughts?

Posted by MasterGee79, 03-13-2007, 03:45 PM
But isn't that what mysql is for? I think it would be just as efficient, if not more, to write to the database than to grep/write to a text file. I may be wrong, or misunderstanding what you are trying to do. These are pretty simple queries. I don't have the benchmarks to prove it, but I would assume that mysql was built to handle something like this with ease, otherwise we would still be using txt files to write/read info. If you are having performance issues with your server, I just don't think that your current logging setup is causing it.

Posted by MasterGee79, 03-13-2007, 03:57 PM
I just wanted to add that I have done what you are suggesting with caching the pages, and it does work very good. I do this instead of using a vbadvanced like setup for my vbulletin forums. I use a crontab to update recent posts/news/whatever every 5 minutes, but you can do even better by having the script check the date the file was modified and have it generate the cache as needed when the page is loaded (when the visitor load the page, write the chance file if it hasn't been modified in X ammount of minutes). And... why not just use AWEstats yourself and let them look at the stats they want to. It only updates once every 24 hours, and is not a strain on the server considering you will be the only one viewing the stats.

Posted by HostRefugee-Vince, 03-13-2007, 04:11 PM
There are no problems with the server itself.. I just personally find it wasteful to use both Apache Logging and MySQL Logging. My last proposed solution obviously has some flaws in regards to writing to an extra file... I agree that the insert statements to MySQL shouldn't be anymore consuming than writing to that file. Maybe I should just turn off Apache logging for the domain altogether being that I rarely check the stats with AWStats anyways. Creating some sort of page caching does seem like a good idea though. The content rarely changes and some of the SELECTS pull quite a bit of data. The site itself doesn't warrant it's own server just yet, but it is expected to grow in terms of daily visits. Honestly, I just want to optimize it in any way possible so it can stay in a shared environment for as long as possible. I figure doing the optimizations now would be better than doing them when it's already becoming problematic for a shared environment. Thanks for everyone's input



Was this answer helpful?

Add to Favourites Add to Favourites

Print this Article Print this Article

Also Read
VPS Siege Results (Views: 621)
EasySpace? (Views: 596)


Language:

Client Login

Email

Password

Remember Me

Search