How many times have you been in a situation, where you had to setup some service in your organisation, but there is no information on how to do it? Maybe you found some documentation, but it’s outdated. Probably, the best possible option you’ve got is to ask a teammate for help, with the hope they did what you need in the past. How many times have you faced issues in your daily work, which required help from the support team? Even then, you have to log into some support tool, fill multiple fields and describe the issue. And that takes time. All that happens while you are so excited – your team has a new project and you just want to jump in! But… first you need to setup everything in development lifecycle tools: create the team, repositories, build pipelines and setup permissions for each team member, etcetera, etcetera. Your enthusiasm will wane, because it all takes hours, if not days to complete. Now, imagine you have a teammate able to help you 24/7 in all of those cases. Sounds unbelievable, but it is not! DevOps ChatBot is here, at your service. It will lead you through the conversation and by asking questions, discover what has to be done.
The whole purpose of this project was to use a machine for repeatable time consuming tasks, which usually are done manually by a support team or developers. Proof of Concept was built by Igor Fiodorow, who delivered sample implementation of a service acting like a human being, for example supporting project creation in Azure DevOps. You could just send a message to our bot using a communicator like Skype or Microsoft Teams and it would guide through the process. Nice, easy, simple. Unfortunately, the security team blocked usage of Azure Registration (required as a part of Microsoft Bot ecosystem) within the client environment, so we had to find a way to utilise the potential we had discovered. The goal was to provide a fully working solution on premise, without exposing any sensitive data to cloud. After a lot of research we discovered that an on-prem solution hasn’t been implemented so far – every example on the Internet needed Azure to work. We decided, as a last resort, to duplicate conversation setup done in bot, working with Azure, but using Lync/Skype to do this. It worked! The bot responded nicely without any need of communication with the cloud ecosystem. We received the green light and started work on a complete solution, aiming to go live as soon as possible.
We’ve created multiple components that act as a whole ecosystem. The general flow of communication is as follows:
- User starts conversation with Bot through Skype, in the same way as they would with any other person.
- A LYNC Façade Service, which is a middle layer between users’ Skype and bot, receives the conversation invitation, handles the handshaking, etc. Then it creates scope for conversation and an initial object for describing every action done by user using skype (sending message, uploading file, etc.). In Microsoft Bot Framework these types of objects are named “Activity”.
- Prepared Activity is passed to module which does all the magic related to communication with LUIS.ai – natural language processing service, handling recognized intents, managing conversation flow (decision tree) and working with external services (e.g. Azure DevOps, Octopus Deploy, ServiceNow, etc.). All response messages are sent back to the Façade, which forwards them to conversation with user. This module utilises the potential of Microsoft Bot Framework – software helping in creation of fully operable bots.
- All actions are stored in Couchbase for auditing.
- Communication with external services (API) and bot behaviour are stored in logs and consumed by Splunk – a commercial service gathering data from multiple sources and exposing them as dashboard or user friendly reports. Based on that we receive alerts on errors and problems with external services, which can be quickly solved.
Securing the bot
With a bot having so much power we had to secure it! This is how we did it:
- Solution is working on premise. This means that, by default, we are protected from communications outside of the company’s network. This reduces a lot of risks related of the bot being hacked.
- Throttling user activity. We count every action that bot takes in favour of a user. This allows us to limit the maximum number of actions and blocks any new actions. Threshold is validated before action is taken.
- Message validation before sending to LUIS. We check message length and if it’s longer than expected, we stop processing it further. Messages containing a file are not being exposed to LUIS; we process them internally.
- File size validation before temporarily storing them on bot server. After processing by bot files are removed from storage.
- Two step file content verification. Non allowed files are stored in Quarantine folder for verification.
- Managing flow of conversation. Step by step, we ask the users the right questions to get the right answers. We allow free text to be inserted in the description fields, but in most cases we provide list of choices to guide the user, otherwise we verify the entered text using regular expressions.
- LUIS is not storing the processed messages. We cannot train a NLP process using data from LUIS logs, because information provided by users may contain sensitive information. Therefore, the training process is done manually using data from the conversation audit reports. We have also provided message tracing as a source for text recognition verification. It is reported by Splunk, when recognition is below expected level (configured in training process).
- A conversation audit report is attached to every new ticket created by the bot on the support platform, after every action taken by the bot, even if it failed to complete.
- A conversation audit report may be retrieved from stored logs, by auditors with eligible permissions. This allows for conversation tracking, crucial for analysis of any abusive behaviour and for supporting bot issues resolution.
Testing and quality assurance
Our testing strategy assumed that we focus on code itself and how it behaves, to eliminate potential issues before we deploy a new version to production environments. In addition, we wanted to test conversation flows that would be similar to the ones with the final user. To achieve all of this, we prepared various test types:
- Unit tests, verifying major part of the core functions. We reached above 70% of real
- LUIS integration tests, verifying score of intent recognition based on provided sample data. It protects our trained LUIS instance from breaking, when we introduce new intents or entities.
- Automated regression and functional tests, written in Gherkin using SpecFlow and C#. Those tests verified all the features served by our bot. The most important thing was testing the conversation flow: we impersonated test users (with different permissions and roles in the system) and communicated with working bot instance. We followed the conversation stages to check all paths of a given feature.
This approach shielded us from introducing major bugs to production and allowed us to keep delivering a stable solution.
The last but most important point of this project was to go live and show everybody that there is a new and easy way to deal with the support tasks.
- Saved tens of days for developers and DevOps support members which they would spend on waiting or taking actions manually.
- Bot usage doubles every month since application went live (counted as actions completed by users).
- Released 13 versions within 8 months.
- Covered our code with:
- ~600 unit tests
- 100+ LUIS integration tests
- 120 automated functional and regression tests verifying all introduced features
- No major issues during the releases (anything that would stop bot from working or break existing features).
- Splunk dashboard for log handling, issues alerting and metrics calculation (e.g. usage of features, number of conversations per period of time).
- LYNC 2013 Server working on premise linked with Bot Framework.
- LYNC 2013 Client used for impersonation in automated tests.
- File transfer through LYNC 2013 using Microsoft encrypted MSNFTP protocol.
- Custom decision tree and conversation workflow allowing the bot to use LUIS.ai at any conversation stage.
- Gherkin logger used for automated tests scenarios preparation. Conversation with bot is stored as scenario, which may be replayed to verify a feature.
- All tests automated – integration with LUIS.ai built in pipeline. Regression and functional tests automated using tester LYNC account impersonation and NUnit Lite runner.
Going to Azure
In order to allow bot to work in Azure we’d need to remove LYNC façade from its ecosystem. This will also make bot more scalable, because by default BotFramework covers conversation state management and any instance of the bot can handle a running conversation with all context included.
As you can see, chatbots can be easily included in any business or technology related process, which requires a lot of setup and interaction with external services. It also can work in situations where the action requires data to be collected from the user, then it provides the data to another service and finally it needs approval by another user.
A chatbot can also act as a Q&A machine, working on top of confluence, for instance, providing a quick response for any question related to a knowledge base.
Because the bot is acting as a regular Skype/LYNC member, it can easily start a conversation with another bot or with a user responsible for approvals, resulting in almost a fully automated process that would normally require a lot of human interactions with multiple systems, something that is tedious and error prone!
Łukasz Tomczak & Maciej Wysocki – major solution architects and developers responsible for design, implementation, testing and delivery
Igor Fiodorow – responsible for base PoC solution
LYNC 2013 Server SDK, LYNC 2013 Client SDK, LUIS.ai, Microsoft BotFramework 3.0, TopShelf, Couchbase, Active Directory, ServiceNow API, Azure DevOps, Azure DevOps API, Octopus Deploy, Octopus Deploy API, C# and .NET Framework 4.7, Quartz .NET, NLog, Autofac, SpecFlow, Git, Microsoft MSNSFTP