Is Synthetic Monitoring Really Going to Die?
More and more people are talking about the end of synthetic monitoring. It is associated with high costs and missing insight into real user performance. This is supported by the currently evolving standards of the W3C Performance Working Group which will help to get more accurate data from end users directly in the browser with deeper insight. Will User Experience Management using JavaScript agents eventually replace synthetic monitoring or will there be a coexistence of both approaches in the end?
I think it is a good idea to compare these two approaches in a number of categories which I see as important from a performance management perspective. Having intensively worked with both approaches I will present my personal experience. Some judgments might be subjective – but this is what comments are for
Real User Perspective
One of the most if not the most important requirement of real user monitoring is to experience performance exactly as real users do. This means how close the monitoring results are to what real application users see.
Synthetic monitoring collects measures using pre-defined scripts executed from a number of locations. How close this is to what users see depends on the actual measurement approach. Only solutions that use real browsers and not just emulate provide reliable results. Some approaches only monitor from high speed backbones like Amazon EC2 and only emulate different connection speeds making measurements only an approximation of real user performance. Solutions like Gomez Last Mile in contrast measure from real user machines spread out across the world resulting in more precise results
Agent-based approaches like dynaTrace UEM measure directly in the user’s browser taking actual connection speed and browser behavior into account. Therefore they provide the most accurate metrics on actual user performance.
Transactional Coverage
Transactional coverage defines how many types of business transactions – or application functionality – are covered. The goal of monitoring is to cover 100 percent of all transactions. The minimum requirement is to cover at least all business critical transactions.
For synthetic monitoring this directly relates to on the number of transactions which are modeled by scripts. The more scripts the higher the coverage. This comes at the cost of additional development and maintenance effort.
Agent-based approaches measure using JavaScript code which gets injected into every page automatically. This results in 100 percent transactional coverage. The only content that is not covered by this approach is streaming content as agent-based monitoring relies on JavaScript to be executed.
SLA Monitoring
SLA monitoring is a central to ensure service quality at the technical and business level. For SLA management to be effective not only internal but also third party services like ads have to be monitored.
While agent-based approaches provide rich information on end-user performance, they are not well suited for SLA management. Agent-based measurement depend on the user’s networks speed, local machine etc. This means a very volatile environment. SLA management however requires a well-defined and stable environment. Another issue with agent-based approaches is that third parties like CDNs or external content providers are very hard to monitor.
Synthetic monitoring using pre-defined scripts and provides a stable and predictable environment. The use of real browser and the resulting deeper diagnostics capabilities enable more fine grained diagnostics and monitoring especially for third party content. Synthetic monitoring can also check SLAs for services which are currently not used by actual users.
Availability Monitoring
Availability monitoring is an aspect of SLA monitoring. We look at it separately as availability monitoring comes with some specific technical prerequisites which are very different between agent-based and synthetic monitoring approaches.
For availability monitoring only synthetic script-based approaches can be used. They do not rely on JavaScript code being injected into the page but measures using on points of presence instead. This enables them to measure although a site is down which is essential for availability monitoring.
Agent-based will not collect any monitoring data if a site is actually down. The only exception is an agent based solution which use also run inside the web server or proxy like dynaTrace UEM. Availability problems resulting for application server problems can then be detected based on HTTP response codes.
Understanding user-specific problems
In some cases – especially in a SaaS environment – the actual application functionality heavily depends on user-specific data. In case of functional or performance problems., information on a specific request of a user is required to diagnose a problem.
Synthetic monitoring is limited to the transactions covered by scripts. In most cases they are based on test users rather than real user accounts (you would not want a monitoring system to operate a real banking account). For an eCommerce site where a lot of functionality does not depend on an actual user, synthetic monitoring provides reasonable insight here. For many SaaS applications this however is not the case.
Agent-based approaches are able to monitor every single user click resulting in a better ability to diagnose user specific problems. They also collect metrics for actual user requests instead of synthetic duplicates. This makes them the preferred solution for web sites where functionality heavily depends on the actual user.
Third Party Diagnostics
Monitoring of third party content poses a special challenge. As the resources are not served from our own infrastructure we only have limited monitoring capabilities.
Synthetic monitoring using real browsers provides the best insight here. All the diagnostics capabilities available within browsers can be used to monitoring third party content. In fact the possibilities for third party and own content are the same. Besides the actual content also networking or DNS problems can be diagnosed.
Agent-based approaches have to rely on the capabilities accessible via JavaScript in the browser. While new W3C standards of the Web Performance Working Group will make this easier in the future it is hard to do in older browser. It requires a lot of tricks to get the information whether third party content loads and performs well.
Proactive Problem Detection
Proactive problem detection targets to find problems before users do. This not only gives you the ability to react faster but also helps to minimize business impact.
Synthetic monitoring tests functionality continuously in production. This ensures that problems are detected and reported immediately irrespective if someone is using the site or not.
Agent-based approaches only collect data when a user actually accesses your site. If for example you are experiencing a problem with a CDN from a certain location in the middle of the night when nobody uses your site you will not see the problem before the first users accesses your site in the morning.
Maintenance Effort
Cost of ownership is always an important aspect of software operation. So the effort needed to adjust monitoring to changes in the application must be taken into consideration as well.
As synthetic monitoring is script based it is likely that changes to the application require changes to scripts. Depending on the scripting language and the script design the effort will vary. In any case there is continuous manual effort required to keep scripts up-to-date.
Agent-based monitoring on the other hand does not require any changes when the application changes. Automatic instrumentation of event handlers etc. ensures zero effort for new functionality. At the same time modern solution automatically inject the required HTML fragments to collect performance data automatically into HTML content at runtime.
Suitability for Application Support
Besides operations and business monitoring, support is the third main user of end user data. In case a customer complains that a web application is not working properly, information on what this user was doing and why it is not working is required.
Synthetic monitoring can help here in case of general functional or performance issues like a slow network from a certain location or broken functionality. It is however not possible to get information on what a user was doing exactly and to follow that user’s the click path.
Agent-based solutions provide much better insight. As they collect data for real user interactions they have all information required for understanding potential issues users are experiencing. So also problems experienced by a single user can be discovered.
Conclusion
Putting all these points together we can see that both – synthetic monitoring and agent-based approaches – have their strengths and weaknesses. One cannot simply choose one over the other. This is also validated by the fact that many companies use a combination of both approaches. This is also true for APM vendors which provide products in both spaces. The advantage of using both approach is that modern agent-based approaches perfectly compensate on the weaknesses of synthetic monitoring leading to an ideal solution.



I’m excited about the advances in RUM but I agree with your conclusion that they both have strengths and both will be necessary (though I’d love to see them become more integrated and start to leverage each other – automatically triggering synthetic tests based on RUM data for example).
Does Gomez Last Mile use real browsers? I thought they used an emulator (or were you just referring to that for the connectivity part)?
Hi Patrick: The Gomez Last Mile uses a real FireFox browser on consumer PCs for both monitoring and load testing.
We share your perspective on better integration/automation/etc. between synthetic and real-user monitoring. For example, we have customers that use RUM to inform their synthetic monitoring strategy, synthetic diagnostics to complement RUM-identified issue, etc. We’re excited about the opportunity to bring these different vantage points together…
Hey Alois,
wow. you managed to find a good way to stuff this big topic into a good quick reading format.
I would like to stress one point. You said: “Agent-based measurement depend on the user’s networks speed….SLA management however requires a well-defined and stable environment”.
You are right the real world is volatile. But in my opinionone have to measure a volatile environment and maybe one have to redefine SLAs.
SLAs often enough are just saying “this availability” and (if) “this time” at this point.
If you publish a web-page and expect some business coming from your end users will they sit at the point of your SLA definition?
I think Customers except a very few directly sit in “unrealistic” places (like backbones).
Volativity means a high grade of deviation from the standard – tough to define in SLAs but – why not give it a try ?
What I like to stress is: Volativity should be refelcted in SLAs – else your effords optimize Performance in highly volitile areas (such as mobile) are not reflected.
And I would like to agree with you. Both measurement types are required in the future. Not necessarily for SLA Reporting but for the reason to get you into a position where you can proactively get aware of problems before customers are effected – including the complete root cause analyses
@Patrick. I agree that the step forward by the new W3C specs is great and will enable us to build really interesting new monitoring tools.
Actively integrating both makes a lot of sense. Actually I was thinking about the same. Being able to automatically trigger synthetic tests when we see problems at the UEM level is something I am looking into as well.
What we see people do today is that they use UEM for monitoring and isolation and then switch to deep analysis tools like dynaTrace Ajax Premium for example.
@Heiko You make a good point here. I agree that SLAs should reflect real world conditions. So if most of your users are using a cell phone on the way home from work you will have to adjust your SLAs – and even more your application – to reflect this.
The point with basing SLAs on real user response times is that they might go down although your infrastructure is working properly. This would not be an SLA breach but rather a condition that you are not prepared for.
This is why we at dynaTrace are showing User Experience and SLA thresholds.
>> Only solutions that use real browsers and not just emulate provide reliable results.<<
Correct. And on Mobile it becomes even more relevant. It's not enough anymore to be "inside" the browser, you need to integrate the carrier and the geo-location to get really see what the user is experiencing.
There's only one solution out there that I know of that does that (packet capture is NOT the same as measuring HTTP traffic perf from inside the browser).
Cheers,
Peter
While agent-based approaches provide rich information on end-user performance, they are not well suited for SLA management. Agent-based measurement depend on the user’s networks speed, local machine etc.
@Garage door remote controls: Depends on where the agents are based. If you use agents based in backbone locations you can use them absolutely for monitoring SLAs. On the other hand – if you want real life SLAs (for your own performance) you need to monitor where real life happens – at the desktop of your users – where network speed and and local machines are less stable. Or the son is playing WoW.
And on Mobile it becomes even more relevant. It’s not enough anymore to be “inside” the browser. Agent-based measurement depend on the user’s networks speed, local machine etc.
nexium vs prilosec