Azure Update Management Hybrid Worker Script – Hybrid Datacenter Management

I spent some time working with the Update Management feature of the Automation Account and noticed a few issues with the example scripts provided by Microsoft so I figured I would share my script with you here.

Today we are using SCCM to automate the majority of our Windows server patching but I was interested in seeing how well the pre/post script functionality in Update Management worked.

Pre/Post Script Overview

The pre and post scripts run at the deployment level. A special property SoftwareUpdateConfigurationRunContext is provided to the runbook when initiated by the update deployment. This property contains all of the configuration settings for the deployment including a list of both Azure virtual machines and non-Azure computers targeted by the deployment.

If you wanted to validate the state of the servers targeted by a deployment you could retrieve the target devices from the run context property and initiate a child runbook for each device in the list. Throwing a terminal error in the pre-script will cause it to fail and the deployment will be stopped.

Microsoft provides an example script for initiating child runbooks on a hybrid worker but their script has several issues in it that needed to be addressed. I am going to call out the parts of their script I had issues with and then share my updated version.

The Hybrid Worker Groups Parameter

They declared a parameter named “HybridWorkerGroups” as a string.

param( 
    [parameter(Mandatory=$true)] [string]$RunbookName, 
    [parameter(Mandatory=$true)] [string]$HybridWorkerGroups, 
    [string]$SoftwareUpdateConfigurationRunContext 
)

Later they try and loop through the members of that string to start the child runbook.

#Start script on each machine 
foreach($machine in $HybridWorkerGroups) 
{ 
    $output = Start-AzureRmAutomationRunbook -Name $RunbookName -ResourceGroupName $ResourceGroup  -AutomationAccountName $AutomationAccount -RunOn $machine 
    $runStatus.Add($output) 
}

What I believe they intended to do was extract the list of servers from the run context attribute and loop through that list, something more like this.

$context = ConvertFrom-Json  $SoftwareUpdateConfigurationRunContext 
$machines = $Context.SoftwareUpdateConfigurationSettings.nonAzureComputerNames | Sort-Object -Unique

#Start script on each machine 
foreach($machine in $machines) 
{ 
    $output = Start-AzureRmAutomationRunbook -Name $RunbookName -ResourceGroupName $ResourceGroup  -AutomationAccountName $AutomationAccount -RunOn $machine 
    $runStatus.Add($output) 
}

While those changes would work, they assume every server is setup as its own hybrid runbook worker and the sole member of a worker group named after the servers FQDN. In my environment I use a single hybrid worker group and interact with the servers in my deployment remotely. That gives me the ability to restart servers as part of my pre/post scripts and validate they came back online. My approach may not be appropriate for all situations though.

Retrieving The Automation Account Details

To get the the name of the resource group and Automation Account required for the Start-AzureRmAutomationJob command they loop through all of the automation accounts configured in the subscription for the currently running job. They use the $PSPrivateMetadata variable that is created as part of each new job to retrieve the JobID guid.

$AutomationResource = Get-AzureRmResource -ResourceType Microsoft.Automation/AutomationAccounts 
 
foreach ($Automation in $AutomationResource) 
{ 
    $Job = Get-AzureRmAutomationJob -ResourceGroupName $Automation.ResourceGroupName -AutomationAccountName $Automation.Name -Id $PSPrivateMetadata.JobId.Guid -ErrorAction SilentlyContinue 
    if (!([string]::IsNullOrEmpty($Job))) 
    { 
        $ResourceGroup = $Job.ResourceGroupName 
        $AutomationAccount = $Job.AutomationAccountName 
        break; 
    } 
}

There isn’t necessarily anything wrong with this approach but I opted to use Automation Account Variables to store those details along with the name of the centralized hybrid worker group I want the scripts to run from. In my case I was already going to need to retrieve the hybrid worker group name so I felt it simplified the script.

$ResourceGroup = Get-AutomationVariable -Name 'UpdateManagement-ResourceGroup'
$AutomationAccount = Get-AutomationVariable -Name 'UpdateManagement-AutomationAccount'
$RunOn = Get-AutomationVariable -Name 'UpdateManagement-HybridWorker'

Tracking Child Runbook Status

After initiating the child runbook for each of the machines in the deployment they loop through the job list waiting for the status of each job to transition to completed.

foreach($job in $runStatus) 
{ 
    #First, wait for each job to complete 
    $currentStatus = Get-AzureRmAutomationJob -Id $job.jobid -ResourceGroupName $ResourceGroup  -AutomationAccountName $AutomationAccount 
    while ($currentStatus.status -ne "Completed") 
        { 
            Start-Sleep -Seconds 5 
            $currentStatus = Get-AzureRmAutomationJob -Id $job.jobid -ResourceGroupName $ResourceGroup  -AutomationAccountName $AutomationAccount 
        } 
    #Then, store the summary 
    $summary = Get-AzureRmAutomationJobOutput -Id $job.jobid -ResourceGroupName $ResourceGroup  -AutomationAccountName $AutomationAccount 
    $finalStatus.Add($summary) 
}

Then they process the results of each completed job to determine if an exception needs to be thrown to halt the deployment.

foreach($summary in $finalStatus) 
{ 
    if ($summary.Type -eq "Error") 
    { 
        #We must throw in order to fail the patch deployment.  
        throw $summary.Summary 
    } 
}

The problem here is that they are only looking for the status to be “Completed” but there are 3 other terminal statuses the child runbook could fall into which would cause your runbook to be stuck in a loop checking status until it times out on the Azure side.

What I have done is used a function to check and see if the job has a terminal status and also setup a timeout so I can fail a job that gets hung or is running for longer than expected.


function IsJobTerminalState([string] $status)
{
  return $status -eq "Completed" -or $status -eq "Failed" -or $status -eq "Stopped" -or $status -eq "Suspended"
}

As I process all of the child runbook jobs I call the function to determine if the job is still running.

foreach($RunningJob in $runStatus)
{
    $currentStatus = $RunningJob | Get-AzureRmAutomationJob
    $pollingSeconds = 15
    $maxTimeout = 1200
    $waitTime = 0
    # Wait until job is no longer running
    while((IsJobTerminalState $currentStatus.Status) -eq $false -and $waitTime -lt $maxTimeout)
    {
        Start-Sleep -Seconds $pollingSeconds
        $waitTime += $pollingSeconds
        $currentStatus = $RunningJob | Get-AzureRmAutomationJob
    }
    # Store job status to evaluate later
    $finalStatus.Add($currentStatus)
}

Once either the child runbook has stopped or the timeout value has been exceeded I add the job status to our final status list. Once all of the child runbooks have finished I evaluate the final status list to determine if the deployment should be stopped.

foreach($Job in $finalStatus)
{
    if ($Job.Status -ne "Completed")
    {
      # Write error with job details for reference
      Write-Error -Message ("Job Status: " + $Job.Status + " RunbookName: " + $Job.RunbookName + " HybridWorker: " + $Job.HybridWorker + " JobID: " + $Job.JobId)
      # Throwing an exception will cause the script to go into a failed state, which will cancel the update deployment.
      throw "Halting update management process"
    }
}

That covers the issues I ran into when initially trying to get pre and post scripts running with Update Management and non azure virtual machines. So far I have been pretty happy with the Update Management functionality and will be working on setting up integration with SCCM next.

I have a copy of my runbook published here. In a future post I will share some examples of pre and post deployment tasks I have have setup.