Check For Non Running Distribution Agents

Today I am going to share with you a SQL Script that can help monitor the status on your distribution agents in a snapshot/transactional replication topology.

Despite their being some built-in monitoring for replication agents in the Replication Monitor, one that seems to be missing is checking for the agents updating the distributor. Sure there’s ones for agents shutting down for whatever reason, but sometimes latency can be reported as “excellent” for a subscriber despite not having updated the distributor for some time. And because it has not updated the distributor, there’s no calcs for transactional latency, which means that whilst on the surface everything is fine, you are slowly heading towards a real headache with replication.

And, despite the SQL Replication Agent being set to RETRY for 4085 Years, occasionally we find that agents either succeed successfully and stop, of just stop after failures. This in turn affects the cleanup job, and this in turn affects replication to subscribers, and so on until replication becomes a real headache to fix. So to alleviate this, below is a job that will check for any agents that have not updated the distributor_history table in the distributor database for the past 15 minutes. It will then send an email out of all agents not updated, and also log a message and all affected agents to the event viewer with a custom error number. Add this script to a job and run every 15 minutes. You may want to fine-tune the frequency; eg you could run the job every 15 minutes but check for any agents that have not updated in an hour.



	-- Email settings
	DECLARE @profile_name VARCHAR(100) = '_MailProfile'
	DECLARE @strsubject VARCHAR(100)
	DECLARE @tableHTML NVARCHAR(max)
	DECLARE @EmailRecipients VARCHAR(max) = 'richard.lee@phoebix.com'

	CREATE TABLE #agentHistory (
		ServerName NVARCHAR (64)
		,Publication NVARCHAR (64)
		,AgentName NVARCHAR(512)
		,LastMsgTime DATETIME
		)

;WITH CTE ( server_name,publication,agent_name,last_update) AS
( 
SELECT
		srvr.srvname
		,agent.publication 
		,agent.NAME
		,MAX(history.TIME)
	FROM MSdistribution_agents agent
	INNER JOIN msdistribution_history history ON history.agent_id = agent.id
	INNER JOIN sys.sysservers srvr on srvr.srvid = agent.subscriber_id
	GROUP BY srvr.srvname,agent.publication,agent.NAME
	)
	INSERT INTO #agentHistory
	SELECT server_name,publication,agent_name,last_update from CTE
	WHERE last_update < dateadd(MINUTE, - 30, getdate())

	IF EXISTS (
			SELECT *
			FROM #agentHistory
			)
	BEGIN
		SELECT @strsubject = 'Replication Agent Alert ' + convert(VARCHAR(17), getdate(), 113) + ' ***'

		SET @tableHTML = N'<H1>Replication Agent has not Logged a Message</H1>' + N'<h3>One or more agents have not logged an update in the past hour</h3>' + N'<table border="1">' + N'<tr><th>Server Name</th>'+ N'<th>Publication</th>'+ N'<th>Agent Name</th>' +  N'<th>Last Updated Time</th>' + '</tr>' + cast((
					SELECT td = AH.ServerName
						,''
						,td = AH.Publication
						,''
						,td = AH.AgentName
						,''
						,td = AH.LastMsgTime
						,''
					FROM #agentHistory AH
					FOR XML path('tr')
						,type
					) AS NVARCHAR(max)) + N'</table>' + CHAR (10) + CHAR (13) +

					'<p> In order to resolve this, log onto the server, start up SQL Server and find the job under "SQL Server Agent -> Jobs". If job is stopped, right click and select "start job at step". The job will start running. Do not wait for it to complete as the job runs continuously. If the same jobs alerts again raise the issue with the DBA Team. '

		EXEC msdb.dbo.sp_send_dbmail @recipients = @EmailRecipients
			,@subject = @strsubject
			,@body = @tableHTML
			,@body_format = 'HTML'
			,@profile_name = @profile_name
	END

DECLARE @COUNT INT
DECLARE @PostLog BIT = 0
DECLARE @@SERVER_NAME VARCHAR (128)
DECLARE @@AGENTNAME VARCHAR (128) = ''
DECLARE @@PUBLICATION VARCHAR (128) = ''

DECLARE @@MESSAGE varchar(2000)
DECLARE @@MSG_ARRAY varchar (2000) = ''

SELECT @COUNT = COUNT (*) FROM #agentHistory
SELECT @PostLog = CASE WHEN @COUNT > 0 THEN 1 ELSE 0 END
WHILE @COUNT > 0
BEGIN

SELECT @@AGENTNAME = MAX (Ah.AgentName) FROM #agentHistory AH

SELECT @@MESSAGE = 'Agents have not logged a message. Log on to server, open SQL Server Management Studio and start job if not running. If job continues to alert that it has stopped then contact DBA Team.
If job is running then monitor and close ticket when agents no longer alert.'
SELECT @@MSG_ARRAY =  @@AGENTNAME + ' '  + CHAR(10) + CHAR(13) + @@MSG_ARRAY + ' ' 

SET @COUNT = @COUNT - 1

DELETE FROM #agentHistory where AgentName = @@AGENTNAME

SELECT @@MESSAGE = @@MESSAGE + CHAR(10) + CHAR(13) +@@MSG_ARRAY 

END

IF @PostLog = 1

BEGIN

USE master
EXEC xp_logevent 68320, @@MESSAGE, error

END

DROP TABLE #agentHistory

Author: Richie Lee

Full time computer guy, part time runner. Full time Dad, part time blogger. Pokémon Nut. Writer of fractured sentences. Maker of the best damn macaroni cheese you've ever tasted.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s