| | | | ------- | ------- | | ![Chord latency](https://files.osf.io/v1/resources/rdsb8/providers/osfstorage/5c781f5782a3950019d0e34a?mode=render =90%x) | ![Chord frequency](https://files.osf.io/v1/resources/rdsb8/providers/osfstorage/5c781f7f62c82a0017e36ba0?mode=render =90%x) | | ![Chord latency](https://files.osf.io/v1/resources/rdsb8/providers/osfstorage/5c781f9d8d5d98001a3ec309?mode=render =90%x) | Chord diagrams showing interactions between different microservices | | | | # Goals Distributed systems with fine-grained components are a growing trend in the industry. Microservice-based architectures and Function as a Service (FaaS) platforms are being favored for the flexibility they afford. This trend is only accentuated by the financial benefits and reduced development times promised by Platforms as a Service (PaaS) and serverless deployments. The benefits of well-defined functional scopes are faster development cycles, team independence, ease of deployment, management, scaling and governance. Microservices and FaaS are the building blocks of modern, highly dynamic distributed systems. However, these advantages come at the cost of more decoupled, fragmented and complex systems. The complexity moves from the components to their interaction, and their emergent behaviors. As systems age and more components are added and modified, system complexity will outgrow the capacity of human operators. No single operator will have understanding of all components and their interactions. The result is at best impaired observability. Therefore, it becomes necessary to develop tools and techniques that can improve observability and automate performance analysis. While the ultimate goal of our research is to ensure quality of service to the end user of large-scale online applications, this involves a number of subgoals: - Anomaly detection: which services are responsible for the long-tail response times? - Diagnosing steady-state problems: which services are affecting average response times? - Model workload: assign and optimize resources according to response time forecasts. To answer these questions we have been performing black-box analysis of legacy and complex systems; modeling systems using queueing theory; building monitoring dashboards. ![Grafana dashboard - dark version] ![Grafana dashboard - light version] # Black Box Analysis Measuring the capacity and modeling the response to load of a real distributed system and its components requires painstaking instrumentation. Even though it greatly improves observability, instrumentation may not be desirable, due to cost, or possible due to legacy constraints. We have also been working on the analysis of non-instrumented components and systems. This is extremely useful for operators, as they need to ensure responsiveness and ways to plan resource provisioning. ![Experimental results] ![Experimental results](https://mfr.de-1.osf.io/export?url=https://osf.io/xnh6w/?direct%2526mode=render%26direct%26mode=render%26action=download%26public_file=True&initialWidth=848&childId=mfrIframe&parentTitle=OSF+%7C+ml_dist.png&parentUrl=https://osf.io/xnh6w/?direct%2526mode=render&format=2400x2400.jpeg) : https://files.osf.io/v1/resources/rdsb8/providers/osfstorage/5c781fd88d5d98001a3ec359?mode=render : https://files.osf.io/v1/resources/rdsb8/providers/osfstorage/5c78200782a3950018cf15cd?mode=render : https://mfr.de-1.osf.io/export?url=https://osf.io/69rcy/?direct%26mode=render%26action=download%26public_file=True&initialWidth=848&childId=mfrIframe&parentTitle=OSF+%7C+mlfd_dots.png&parentUrl=https://osf.io/69rcy/&format=2400x2400.jpeg
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.