To follow Site Reliability Engineering (SRE) principles, the recommended approach when introducing a new system, like an image recognition login system, is to minimize risk and test in a controlled environment. SRE emphasizes progressive rollouts and monitoring to ensure reliability, stability, and security.
Option A: Roll out the new system to a subset of employees to test it out is the correct answer because this aligns with SRE practices, such as:
Canary Releases: Deploying a new feature to a small group of users (subset of employees) allows the organization to test the system in a real-world environment with minimal risk. This practice helps identify any potential issues, gather feedback, and monitor system behavior without impacting the entire organization.
Gradual Rollouts: SRE encourages gradual rollouts to detect and mitigate any failures early, reducing the blast radius in case of a malfunction or security issue. This ensures the reliability of the system while maintaining service quality.
Monitoring and Observability: Rolling out to a subset allows for comprehensive monitoring and collection of metrics to ensure the system performs as expected. If issues arise, they can be quickly identified and resolved before a wider deployment.
Option B: Roll out the new system to all employees to collect as much data as possible is not advisable under SRE principles because it poses a higher risk of widespread failure or security issues.
Option C: Avoid rolling out the new system because it may have security flaws and Option D: Avoid rolling out the new system because it may violate privacy policy are also incorrect in the context of SRE. While security and privacy are crucial, outright avoidance does not align with SRE practices. Instead, risk is managed through controlled rollouts, testing, and monitoring.
References:
Google Cloud SRE Workbook: Canarying Releases
Google Cloud SRE Principles: Monitoring, Progressive Rollouts
Google Cloud Architect's Guide: Reliability and Risk Management