Usage Reports longer than 30 days - SharePoint Log Parsing

I recently had a client that wanted metrics on sharepoint usage in their farm.  Of course I already knew about the out-of-box (OOB) usage reporting capabilities of SharePoint 2007.  But the limitations with the OOB reports is that they only show a rolling 30 day picture of usage and show usage for one site collection at a time.  My client wanted to see the metrics for the whole site, a whole year and quarterly reports.  So I went and performed a Google search and found tons of articles on parsing Sharepoint’s log files.  I even built a SSIS solution to parse the log files and place the usage data in a database I created.  I kept thinking to myself, this solution isn’t very elegant and is problematic for implementing in a production environments. 

If you have read any of my other blog entries, you know that I typically go “under the hood” of SharePoint to see what is really happening. So exactly what happens when you turn on Usage Analysis Processing under Operations in Central Administration?  First, SharePoint creates several new database tables in the SSP database. The table names are:

• ANLDay
• ANLHistoricalSiteUsage
• ANLHistoricalWebUsage
• ANLHit
• ANLResource
• ANLUser
• ANLWeb

Then SharePoint starts to create usage log files in the default location C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\12\Logs (unless you changed that location when you started Usage Analysis Processing) with a name format of ex091102.log.  The 091102 portion of the log name changes daily to reflect the date.  Next, depending upon what time frame you entered into Processing Settings section on the Usage Analysis Processing page referenced above, SharePoint will parse through these logs on a daily basis at that time. SharePoint then takes this data and populates the new above mentioned tables in the SSP database. 

Wow that’s great Julian, I’ll just query those tables and start creating my reports!  Well not so fast…  Remember that nasty 30 day rolling picture of usage we discussed earlier?  SharePoint has a stored procedure that will delete table data every day while performing the Usage Analysis Processing job.  The result is that it will delete the data from 31 days ago and add the new data from the day before to keep a rolling 30 days worth of data.  Here are the actual SQL comments inside the stored procedure:

-- Check if the ANLHit table needs to be aggregated into historical data.  We expect to have at most 30 days of daily detailed usage data and at most 335 days of daily summary data. The 30 days starts with "yesterday"'s  (i.e. @TodayDayId-1) and goes back to @TodayDayId-30. The summary data starts at @TodayDayId-365 and goes to @TodayDayId-31.

Bummer…  Now what?  Let’s get further under the hood to make a few modifications to solve this final challenge. 

Disclaimer:

In my solution you will be making a modification to a SharePoint 2007 database.  It is erroneously reported all over the internet that if you make a change to a SharePoint database you will loose all support from Microsoft.  That simply isn’t true – Here is the verbiage straight from Microsoft:

If an unsupported database modification is discovered during a support call, the customer must perform one of the following procedures at a minimum:
• Perform a database restoration from the last known good backup that did not include the database modifications
• Roll back all the database modifications

http://support.microsoft.com/kb/841057

So, the first thing we are going to do is back up the stored procedure named proc_ANL_ProcessShadowTables (located in the SSP database) in safe location just in case we need to “Roll back database modifications”.  Then we are going to modify the default 30 days in this stored procedure to whatever date range value you need for your reporting. 

Open SQL Server Management Studio and drill down to your SSP database > Programmability >Stored Procedures.  Find the stored procedure named proc_ANL_ProcessShadowTables and right click it, then select Modify.  You will need to scroll down to lines I have highlighted in green below (lines 46 and 368) and replace the 30 with a numerical value equal to the amount time you want SharePoint to store the parsed log data.  In the example below, I have replaced the 30 with 90.  In this example, SharePoint will not delete any data until we reach that 90 day parameter. Once you have made the changes, hit F5 or mouse click Execute in the SQL Server Management Studio toolbar.

You are all set, just start writing your queries for reporting straight from the SSP ANL tables!

Modifications to proc_ANL_ProcessShadowTables

set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
go


-------------------------------------------------------------------------------
-- Usage data processing stored procedures
-------------------------------------------------------------------------------

--
-- Imports the data from the shadow tables into the master tables.
--
-- Both the Shadow and Master locks must have been acquired before calling 
-- this procedure in order to guarantee that this procedure and the reporting 
-- procedures will behave predictably and reliably.
--
ALTER PROCEDURE [dbo].[proc_ANL_ProcessShadowTables] 
 @TodayDayId INT
AS

-- 0. Check if any work actually needs to be done before we start the heavy
-- lifting. Even though it is safe to process usage data multiple times per 
-- day, it can be very taxing to the database server.

-- The most common need for processing is newly-imported usage data. Check for
-- that first.

DECLARE @Hits INT, @Resources INT
SELECT @Hits = COUNT(*) FROM ANLShadowHit
SELECT @Resources = COUNT(*) FROM ANLShadowResource

IF @Hits + @Resources = 0
BEGIN
-- Check if the ANLHit table needs to be aggregated into historical data. 
-- We expect to have at most 30 days of daily detailed usage data and at 
-- most 335 days of daily summary data. The 30 days starts with "yesterday"'s 
-- data (i.e. @TodayDayId-1) and goes back to @TodayDayId-30. The summary data 
-- starts at @TodayDayId-365 and goes to @TodayDayId-31.
--
-- Check if there is any old data in ANLHit (the detailed data) which needs to 
-- be aggregated into the historical tables.
 DECLARE @MinDayId INT
 SELECT @MinDayId = MIN(DayId) FROM ANLHit

-- CHANGED 30 TO 90 HERE
 IF @MinDayId >= (@TodayDayId-90)
 BEGIN
  -- Next, check if old data in the historical tables need to be deleted. We
  -- only need to inspect ANLHistoricalWebUsage because ANLHistoricalSiteUsage 
  -- won't have anything older.
  SELECT @MinDayId = MIN(DayId) FROM ANLHistoricalWebUsage
  IF @MinDayId >= (@TodayDayId-365)
  BEGIN
   -- Lastly, check if ANLDay needs to be updated. If we got here then we know
   -- we have no usage data for 'yesterday' but may still need to create a new
   -- row for 'yesterday' in ANLDay so that we see it with 0 hits in our 
   -- reports. Similarily, we know we don't have any historical usage data 
   -- older than Today-365, but we may have a row in ANLDay for it which needs
   -- to be deleted.

   -- Note: at this point we have the option of very cheaply removing or adding
   -- the appropriate rows in ANLDay. Consider this an optimization for sites
   -- that have a very uneven usage distribution and may receive no hits at all
   -- on some days but over the year accumulate a large amount of usage data.
   IF EXISTS (SELECT * FROM ANLDay WHERE DayId = (@TodayDayId-1)) AND NOT EXISTS (SELECT * FROM ANLDay WHERE DayId < (@TodayDayId-365))
   BEGIN
    -- Nothing to do; all usage data is up to date.
    RETURN -- EARLY RETURN; 
   END
  END
 END
END

-- Take out table-level locks on all analytics tables to avoid row-level locking.
SELECT * FROM ANLDay WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLHistoricalSiteUsage WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLHistoricalWebUsage WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLHit WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLResource WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLWeb WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLShadowHit WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLShadowResource WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLShadowWeb WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLShadowUser WITH (TABLOCK, HOLDLOCK) WHERE 1=0
SELECT * FROM ANLUser WITH (TABLOCK, HOLDLOCK) WHERE 1=0

-- 1. Reconfigure the indexes and drop indexed views to optimize for bulk 
-- inserts and updates.

-- 2. Import users, webs and resources. WSS makes it impossible to track rename 
-- and move operations on resources so when that happens we'll just treat
-- it as a new resource.

-- 2a. Import ANLShadowUser into ANLUser

-- Import all new users from ANLShadowUser
INSERT INTO ANLUser (UserName)
SELECT DISTINCT UserName FROM ANLShadowUser
    WHERE NOT EXISTS (SELECT * FROM ANLUser u WHERE u.UserName = ANLShadowUser.UserName)

-- Update ANLShadowHit to point to the correct ANLUser record 
UPDATE ANLShadowHit
SET UserId = u.UserId 
FROM ANLUser u, ANLShadowUser su
WHERE su.UserId = ShadowUserId AND u.UserName = su.UserName

-- Clear out ANLShadowUser
TRUNCATE TABLE ANLShadowUser

-- 2b. Import ANLShadowWeb into ANLWeb

-- There is a remote chance that we'll have two different WebUrls for the same 
-- WebGuid. This could happen if a Web is renamed, and will only occur in the 
-- logs for that one day. To mitigate this we'll just drop an arbitrary 
-- duplicate Web and fix the broken URL on the next day's import.

-- Delete the duplicates (very rare)
DELETE FROM ANLShadowWeb 
WHERE EXISTS 
    (SELECT * FROM ANLShadowWeb sw2 WHERE sw2.WebGuid = ANLShadowWeb.WebGuid AND sw2.WebUrl < ANLShadowWeb.WebUrl)
OPTION (MAXDOP 4)

-- Update the names of any renamed webs
UPDATE ANLWeb 
SET WebUrl = sw.WebURL
FROM ANLShadowWeb sw
WHERE ANLWeb.WebGuid = sw.WebGuid

-- Import all new webs from ANLShadowWeb
INSERT INTO ANLWeb (WebGuid, WebUrl)
SELECT DISTINCT WebGuid, WebUrl FROM ANLShadowWeb
    WHERE NOT WebGuid = '00000000-0000-0000-0000-000000000000'
    AND NOT EXISTS (SELECT * FROM ANLWeb w WHERE w.WebGuid = ANLShadowWeb.WebGuid)

-- Clear out ANLShadowUser
TRUNCATE TABLE ANLShadowWeb

-- 2c. Import ANLShadowResource into ANLResource

-- Update any non-resolved resources in ANLResource with resolved versions from
-- ANLShadowResource.
UPDATE ANLResource
SET WebAppGuid = s.WebAppGuid, SiteGuid = s.SiteGuid, WebGuid = s.WebGuid, DocName = s.DocName
FROM ANLShadowResource s
WHERE ANLResource.WebAppGuid = '00000000-0000-0000-0000-000000000000' AND NOT s.WebAppGuid = '00000000-0000-0000-0000-000000000000'
AND ANLResource.FullUrl = s.FullUrl

-- Insert any WSS resources which exist in ANLShadowResource but not ANLResource.
INSERT INTO ANLResource
SELECT DISTINCT WebAppGuid, SiteGuid, WebGuid, DocName, FullUrl, HostDns FROM ANLShadowResource a
WHERE NOT EXISTS (SELECT * FROM ANLResource b WHERE a.FullUrl = b.FullUrl)

-- Try to resolve URLs which have no WSS reference info (such as webguid, etc)
-- by looking them up in ANLResource. 

-- Update ANLShadowHit to point to the correct ANLResource records. 

UPDATE ANLShadowHit 
SET ResourceId = r.ResourceId 
FROM ANLResource r, ANLShadowResource sr
WHERE ShadowResourceId = sr.ResourceId 
AND r.FullUrl = sr.FullUrl

-- Fix referrers. We're using 0 to mark null referrers, and we know that no 
-- resources have that ID so those ShadowReferrerResourceIds will become
-- null ReferrerResourceIds.

-- Update referring resources which we know are internal (ones where WebAppGuid 
-- is not null).
UPDATE ANLShadowHit 
SET ReferrerResourceId = r.ResourceId 
FROM ANLResource r, ANLShadowResource sr
WHERE ShadowReferrerResourceId = sr.ResourceId 
AND NOT sr.WebAppGuid = '00000000-0000-0000-0000-000000000000' 
AND r.FullUrl = sr.FullUrl

-- Update resources which we suspect to be external but may just be unresolved
-- internals (ones where WebAppGuid is null).
UPDATE ANLShadowHit 
SET ReferrerResourceId = r.ResourceId 
FROM ANLResource r, ANLShadowResource sr
WHERE ShadowReferrerResourceId = sr.ResourceId 
AND sr.WebAppGuid = '00000000-0000-0000-0000-000000000000' 
AND r.FullUrl = sr.FullUrl

-- Clear out ANLShadowResource
TRUNCATE TABLE ANLShadowResource

-- 3. Summarize and archive hits older than 30 days into the Historical tables.

-- #TempSummary is used as a temporary bucket of stats which makes it a little 
-- easier to store the aggregations before putting them into the Historical 
-- tables. SummaryGuid can refer to WebGuid or SiteGuid, depending on what type 
-- of object we're summarizing.
CREATE TABLE #TempSummary (SummaryDayId int, SummaryGuid uniqueidentifier, Value int, CONSTRAINT PK_WebHitsSummary PRIMARY KEY (SummaryDayId, SummaryGuid))

-- ANLHistoricalWebUsage.Hits
insert into #TempSummary (SummaryDayId, SummaryGuid, Value) 
select DayId, WebGuid, count_big(*) 
from ANLShadowHit, ANLResource 
where ANLShadowHit.ResourceId = ANLResource.ResourceId 
group by DayId, WebGuid

-- Update the the existing records, then insert any missing ones.

update ANLHistoricalWebUsage 
set Hits = Hits + Value 
from #TempSummary
where SummaryGuid = WebGuid and SummaryDayId = DayId

-- The following INSERT statement will seed the other 3 aggregations 
-- (UniqueUsers, HomePageHits, HomePageUniqueUsers) with 0 values so that they
-- do not need to be inserted anymore, just updated.

INSERT INTO ANLHistoricalWebUsage (DayId, WebGuid, Hits, UniqueUsers, HomePageHits, HomePageUniqueUsers) 
SELECT SummaryDayId, SummaryGuid, Value, 0, 0, 0 from #TempSummary WHERE NOT EXISTS (
	SELECT * FROM ANLHistoricalWebUsage WHERE SummaryGuid = WebGuid AND SummaryDayId = DayId)

TRUNCATE TABLE #TempSummary

-- ANLHistoricalWebUsage.UniqueUsers
insert into #TempSummary (SummaryDayId, SummaryGuid, Value) 
select DayId, WebGuid, count_big(distinct UserId) 
from ANLShadowHit, ANLResource
where ANLShadowHit.ResourceId = ANLResource.ResourceId
group by DayId, WebGuid

update ANLHistoricalWebUsage
set UniqueUsers = UniqueUsers + Value 
from #TempSummary
where SummaryGuid = WebGuid and SummaryDayId = DayId

truncate table #TempSummary

-- ANLHistoricalWebUsage.HomePageHits
insert into #TempSummary (SummaryDayId, SummaryGuid, Value) 
select DayId, WebGuid, count_big(*) 
from ANLShadowHit, ANLResource
where ANLShadowHit.ResourceId = ANLResource.ResourceId AND (DocName = N'' OR DocName = NULL OR LOWER(DocName) = N'default.aspx' OR RIGHT(LOWER(DocName),13) = N'/default.aspx')
group by DayId, WebGuid 

update ANLHistoricalWebUsage
set HomePageHits = HomePageHits + Value 
from #TempSummary
where SummaryGuid = WebGuid and SummaryDayId = DayId

truncate table #TempSummary

-- ANLHistoricalWebUsage.HomePageUniqueUsers
insert into #TempSummary (SummaryDayId, SummaryGuid, Value) 
select DayId, WebGuid, count_big(distinct UserId) 
from ANLShadowHit, ANLResource 
where ANLShadowHit.ResourceId = ANLResource.ResourceId AND (DocName = N'' OR DocName = NULL OR LOWER(DocName) = N'default.aspx' OR RIGHT(LOWER(DocName),13) = N'/default.aspx')
group by DayId, WebGuid

update ANLHistoricalWebUsage
set HomePageUniqueUsers = HomePageUniqueUsers + Value 
from #TempSummary
where SummaryGuid = WebGuid and SummaryDayId = DayId

truncate table #TempSummary

DELETE FROM ANLHistoricalWebUsage WHERE DayId < (@TodayDayId - 365)
OPTION (MAXDOP 4)

-- ANLHistoricalSiteUsage.Hits

INSERT INTO #TempSummary (SummaryDayId, SummaryGuid, Value) 
SELECT DayId, SiteGuid, COUNT_BIG(*) 
FROM ANLShadowHit, ANLResource 
WHERE ANLShadowHit.ResourceId = ANLResource.ResourceId
GROUP BY DayId, SiteGuid

UPDATE ANLHistoricalSiteUsage
SET Hits = Hits + Value 
FROM #TempSummary
WHERE SummaryGuid = SiteGuid AND SummaryDayId = DayId

-- The following INSERT statement will seed the other aggregation (UniqueUsers) 
-- with a 0 values so that it does not need to be inserted anymore, just 
-- updated.

INSERT INTO ANLHistoricalSiteUsage (DayId, SiteGuid, Hits, UniqueUsers) 
SELECT SummaryDayId, SummaryGuid, Value, 0 FROM #TempSummary 
WHERE NOT EXISTS (SELECT * FROM ANLHistoricalSiteUsage WHERE SummaryGuid = SiteGuid AND SummaryDayId = DayId)

TRUNCATE TABLE #TempSummary

-- ANLHistoricalSiteUsage.UniqueUsers

INSERT INTO #TempSummary (SummaryDayId, SummaryGuid, Value) 
SELECT DayId, SiteGuid, COUNT_BIG(DISTINCT UserId) 
FROM ANLShadowHit, ANLResource
WHERE ANLShadowHit.ResourceId = ANLResource.ResourceId
GROUP BY DayId, SiteGuid 

UPDATE ANLHistoricalSiteUsage
SET UniqueUsers = UniqueUsers + Value 
FROM #TempSummary
WHERE SummaryGuid = SiteGuid AND SummaryDayId = DayId

TRUNCATE TABLE #TempSummary

DELETE FROM ANLHistoricalSiteUsage WHERE DayId < (@TodayDayId - 365)
OPTION (MAXDOP 4)

-- Note that the most recent hits in ANLHit are from yesterday, hence " < (today-30)" and not " <= (today-30)"
-- CHANGED 30 TO 90 HERE
DELETE FROM ANLHit WHERE DayId < (@TodayDayId - 90) 
OPTION (MAXDOP 4)

-- 4. Import ANLShadowHit

-- Fix the unknown referrers in ANLShadowHit. Do this in two passes: first try 
-- finding fully-resolved resources in ANLShadowResource, then go to 
-- ANLResource for anything left unresolved.
--
-- NOTE: we are not enforcing uniqueness on dbo.ANLHit.FullUrl so weirdness may 
-- 	occur with really long URLs that are truncated and result in 
--	duplicates.

-- Copy the shadow hits into the master table.
INSERT INTO ANLHit (DayId, ResourceId, UserId, ReferrerResourceId) 
SELECT DayId, ResourceId, UserId, ReferrerResourceId FROM ANLShadowHit

-- We're now done with ANLShadowHit
TRUNCATE TABLE ANLShadowHit

-- 5. Delete any resources that are no longer referenced by any hits. Also 
-- check that the summary tables don't use the resources. Note that we're not
-- checking against ANLHistoricalSiteUsage because ANLHistoricalWebUsage is 
-- a superset.

-- Note that we're keeping just the home page of webs that have historical 
-- usage data.
DELETE FROM ANLResource
WHERE NOT EXISTS (SELECT * FROM ANLHit h WHERE h.ResourceId = ResourceId) 
AND NOT (DocName = N'' OR DocName = NULL OR LOWER(DocName) = N'default.aspx' OR RIGHT(LOWER(DocName),13) = N'/default.aspx') AND EXISTS (SELECT * FROM ANLHistoricalWebUsage s WHERE s.WebGuid = WebGuid)
OPTION (MAXDOP 4)

-- Clean up the ANLWeb table. 
-- We'll only drop webs that haven't been accessed in the last year (no historical data).
DELETE FROM ANLWeb 
WHERE NOT EXISTS (SELECT * FROM ANLWeb INNER JOIN ANLHistoricalWebUsage ON ANLWeb.WebGuid = ANLHistoricalWebUsage.WebGuid) 
OPTION (MAXDOP 4)

-- 6. Update ANLDay

-- Create any missing ANLDay entries by inserting all days into ANLDay between 
-- the oldest day day in ANLHistoricalWebUsage and yesterday. 
DECLARE @newDayId INT
SELECT @newDayId = ISNULL(MIN(DayId), @TodayDayId-1) FROM ANLHistoricalSiteUsage
WHILE @newDayId < @TodayDayId
BEGIN
 IF (SELECT COUNT(*) FROM ANLDay WHERE DayId = @newDayId) = 0
 BEGIN
  INSERT INTO ANLDay (DayId, FullDate, IsHistorical) VALUES (@newDayId, DATEADD(DAY,@newDayId,'20000101'), 0)
 END
 SELECT @newDayId = @newDayId + 1
END

-- Delete days older than 365 days from today
DELETE FROM ANLDay WHERE DayId < (@TodayDayId - 365)
OPTION (MAXDOP 4)

-- Update the IsHistorical flag to 1 on all DayIds older than 30 days
-- CHANGED 30 TO 90 HERE
UPDATE ANLDay SET IsHistorical = CASE WHEN DayId < @TodayDayId - 90 THEN 1 ELSE 0 END


Permalink | Comments (0) | Post RSSRSS comment feed

Comments