Engineering Team Content Hub - Tips From Engineers

Engineering Blog

Why Confidence Matters: Experimental Design

by Cassie Quek Wednesday, January 19th, 2022

This post is part three of Why Confidence Matters, a series about how we improved Defender’s confidence score to unlock a number of important features. You can read part one here and part two here. Bringing our series to a close, we explore the technical design of our research pipeline that enabled our Data Scientists to iterate over models with speed. We aim to provide an insight into how we solved issues pertaining to our particular dataset, and conclude with how this project had an impact for our customers and product. Why design a pipeline? Many people think that a Data Scientist’s job is like a Kaggle competition – you throw some data at a model, get the highest scores, and boom, you’re done! In reality, building a product such as Tessian Defender was never going to be a one-off job. The challenges of making a useful machine learning (ML) model in production lies not only in its predictive powers, but also in its speed of iteration, reproducibility, and ease of future improvements. At Tessian, our Data Scientists oversee the project end-to-end, from conception and design, all the way through to deployment in production, monitoring, and maintenance. Hence, our team started by outlining the above longer-term requirements, then sat down together with our Engineers to design a research flow that would fulfill these objectives. Here’s how we achieved the requirements we set out.

The research pipeline

The diagram above shows the design of the pipeline with its individual steps, starting from the top left. An overall configuration file specifies many parameters for the pipeline, such as the date range for the email data we’ll be using and the features we’ll compute. The research pipeline is then run on Amazon Sagemaker, and takes care of everything from ingesting the checked email data from S3 (Collect Logs step) to training and evaluating the model (at the bottom of the diagram). Because the pipeline is split into independent and configurable “steps”, each storing its output before the next picks it up, we were able to iterate quickly. This provided flexibility to configure and re-run from any step without having to re-run all the previous steps, which allowed for experimentation at speed. In our experience, we had only to revise the slowest data collection and processing steps a couple of times to get it right (steps 1-3), and most work and improvements involved experimenting with the features and model training steps (steps 4-5). The later research steps take only a few minutes to run as opposed to hours for the earlier steps, and allow us to test features and obtain answers about them quickly.

Five Key Steps within the Pipeline Some of these will be familiar to any Data Science practitioner. We’ll leave out general descriptions of these well-known ML steps, and instead focus on the specific adjustments we made to ensure the confidence model worked well for the product. 1. Collect Logs This step collects all email logs with user responses from S3 and transforms them to a format suitable for later use, stored separately per customer. These logs contain information on decisions made by Tessian Defender, using data available at the time of the check. We also lookup and store additional information to enrich and add context to the dataset at this stage. 2. Split Data The way we choose to create the training and test datasets is very important to the model outcome. As mentioned before, consistency in model performance across different cuts of the data is a major concern and success criterion. In designing our cross-validation strategy, we utilized both time-period hold-outs and a tenants hold-out. The time-period hold-out allows us to confirm that the model generalizes well across time even as the threat landscape changes, while testing on a tenant hold-out ensures the model generalises well across all our customers, that are spread across industries and geographical regions. Having this consistency means that we can confidently onboard new tenants and maintain a similar predictive power of Tessian Defender on their email traffic. However, the downside to having multiple hold-outs is that we’re effectively throwing out data that did not fit within both constraints for each dataset, as demonstrated in the chart below.

We eventually compromised by allowing a slight overlap between train and validation tenants (but not on test tenants), minimizing the data discarded where possible. 3. Labels Aggregation In part two, we also highlighted that one of the challenges of the user-response dataset is mislabelled data. Greymail and spam are often wrongly labeled as phishing, and can cause the undesired effect of the model prioritizing spam, making the confidence score less meaningful for admins. Users also often disagree on whether the same email is safe or malicious. This step takes care of these concerns by cleaning out spam and aggregating the labels. In order to assess the quality of user-feedback, we first estimated the degree of agreement between user-labels and security expert labels using a sample of emails, and found that user-labels and expert-labels matched in around 85% of cases. We addressed the most systematic bias observed in this exercise by developing a few simple heuristics to correct cases where users reported spam emails as malicious. Where we have different labels for copies of the same email sent to multiple users, we applied an aggregation formula to derive a final label for the group. This formula is configurable, and carefully assessed to provide the most accurate labels. 4. Features This step is where most of the research took place – trialing new feature ideas and iterating on them based on feature analysis and metrics from the final step. The feature computation actually consisted of two independently configurable steps: one for batch features and another for individually computed features. The features consisted of some natural language processing (NLP) vectorizations which were computed faster as a batch, and were more or less static after initial configurations. Splitting it out simplified the structure and maximized our flexibility. Other features based on stateful values (dependent on the time of the check) such as domain reputations and information from external datasets were computed or extracted individually, such as whether any of the URL domains in the email was registered recently. 5. Model Training and Evaluation In the final and arguably most exciting step of the pipeline, the model is created and evaluated. Here, we configure the model type and its various hyperparameters before training the model. Then, based on the validation data, the “bucket” thresholds are defined. As mentioned in part two, we defined five confidence buckets that simplified communication and understanding with users and stakeholders. These buckets range in priority from Very Low to Very High. In addition, this step produces the key metrics we’ll use to compare the models. These metrics include both generic ML metrics and Tessian Defender product-specific metrics as mentioned in part two, against each of the data splits. Using MLFLow, we can keep track of the results of our experiments neatly, logging the hyperparameters, metrics, and even store certain artifacts that would be relevant in case we needed to reproduce the model. The interface allowed us to easily compare models based on their metrics. Our team held a review meeting weekly to discuss the things we’ve tried and the metrics it has produced before agreeing on next steps and experiments to try. We found this practice very effective as the Data Science team rallied together to meet a deadline each week, and product managers could easily keep track of the project’s progress. During this process, we also kept in close contact with several beta users to gather quick feedback on the work-in-progress models, ensuring that the product was being developed with their needs in mind.

The improved confidence score The new priority model was only deployed when we hit the success criteria we set out to meet. As set out in part two, besides the many metrics such as AUC-ROC we tracked internally in order to give us direction and compare the many models, our main goal was always to optimize the users’ experience. That meant that the success criteria depended on product-centric metrics: the precision and number of quarantined emails for a client, the rate at which we could improve overall warning precision, and consistency of performance across different slices of data (time, tenants, threat types). Based on the unseen test data, we observed a more-than-double improvement in the precision of our highest priority bucket, with our newest priority model. This improved the user experience of Tessian Defender greatly, as it meant that a security admin could now find malicious emails more easily and act on it more quickly, and that quarantining emails without compromising on users’ workflow was a possibility.

Product Impact As a Data Scientist working on a live app like Tessian Defender, rolling out a new model is always the most exciting part of the process. We get to observe the product impact of the model instantly, and get feedback through the monitoring devices we have in place, or by speaking directly with Defender customers. As a result of the improved precision in the highest priority bucket, we unlocked the ability to quarantine with confidence. We are assured that the model is able to quarantine a significant number of threats (for all clients), massively reducing risk exposure for the company, and saving employees precious time and the burden and responsibility of discerning malicious mails, at a low rate of false positives. We also understand that not all false positives are equal – for example, accidentally quarantining a safe newsletter has almost zero impact compared to quarantining an urgent legal document that requires immediate attention. Therefore, prior to roll-out, our team also made inquiries to quantify this inconvenience factor, ensuring that the risk of quarantining a highly important, time-sensitive email was highly unlikely. All of this meant that the benefit of turning on auto-quarantine and protecting the user from a threat far outweighs the risk of interrupting the user’s work-flow and any vital business operations.

With this new model, Tessian Defender-triggered events are also being sorted more effectively. Admins who log in to the Tessian portal will find the most likely malicious threats at the top, allowing them to act upon the threats instantly. Admins can quickly review the suspicious elements highlighted by Tessian Defender and gain valuable insights about the email such as: its origin how often the sender has communicated with the organization’s users how users have responded to the warning They can then take action such as removing the email from all users’ inboxes, or adding the sender to a denylist. Thus, even in a small team, security administrators are able to effectively respond to external threats, even in the face of a large number of malicious mails, all the while continuing to educate users in the moment on any phishy-looking emails.

Lastly, with the more robust confidence model, we are able to improve the accuracy of our warnings. By ensuring a high warning precision overall, users pay attention to every individual suspicious event, reap the full benefits of the in-situ training, and are more likely to pause and evaluate the trustworthiness of the email. As the improved confidence model is able to provide a more reliable estimate on the likelihood of an email being malicious, we are able to cut back on warning on less phishy emails that a user would learn little out of. This concludes our 3-part series on Why Confidence Matters. Thank you for reading! We hope that this series has given you some insight into how we work here at Tessian, and the types of problems we try to solve. To us, software and feature development is more than just endless coding and optimizing metrics in vain – we want to develop products that will actually solve peoples’ problems. If this work sounds interesting to you, we’d love for like-minded Data Scientists and Developers to join us on our mission to secure the Human Layer! Check out our open roles and apply today. (Co-authored by Gabriel Goulet-Langlois and Cassie Quek)

Read Blog Post

Engineering Blog

Tessian’s Client Side Integrations QA Journey – Part I

by Craig Callender Thursday, May 20th, 2021

In this series, we’re going to go over the Quality Assurance journey we’ve been on here in the Client Side Integrations (CSI) team at Tessian. Most of this post will be using our experience with the Outlook Add-in, as that’s the piece of software most used by our clients. But the philosophies and learnings here apply to most software in general (regardless of where it’s run) with an emphasis on software that includes a UI. I’ll admit that the onus for this work was me sitting in my home office Saturday morning knowing that I’d have to start manual testing for an upcoming release in the next two weeks and just not being able to convince myself to click the “Send” button in Outlook after typing a subject of “Hello world” one more time… But once you start automating UI tests, it just builds on itself and you start being able to construct new tests from existing code. It can be such a euphoric experience. If you’ve ever dreaded (or are currently dreading) running through manual QA tests, keep reading and see if you can implement some of the solutions we have. Why QA in the Outlook Add-in Ecosystem is Hard The Outlook Add-in was the first piece of software written to run on our clients’ computers and, as a result of this, alongside needing to work in some of the oldest software Microsoft develops (Outlook), there are challenges when it comes to QA. These challenges include: Detecting faults in the add-in itself Detecting changes in Outlook which may result in functionality loss of our add-in Detecting changes in Windows that may result in performance issues of our add-in Testing the myriad of environments our add-in will be installed in The last point is the hardest to QA, as even a list of just a subset of the different configurations of Outlook shows the permutations of test environments just doesn’t scale well: Outlook’s Online vs Cached mode Outlook edition: 2010, 2013, 2016, 2019 perpetual license, 2019 volume license, M365 with its 5 update channels… Connected to On-Premise Exchange Vs Exchange Online/M365 Other add-ins in Outlook Third-party Exchange add-ins (Retention software, auditing, archiving, etc…) And now add non-Outlook related environment issues we’ve had to work through: Network proxies, VPNs, Endpoint protection Virus scanning software Windows versions One can see how it would be impossible to predict all the environmental configurations and validate our add-in functionality before releasing it. A Brief History of QA in Client Side Integrations (CSI) As many companies do – we started our QA journey with a QA team. This was a set of individuals whose full time job was to install the latest commit of our add-in and test its functionality. This quickly grew where this team was managing/sharing VMs to ensure we worked on all those permutations above. They also worked hard to try and emulate the external factors of our clients’ environments like proxy servers, weak Internet connections, etc… This model works well for exploratory testing and finding strange edge cases, but where it doesn’t work well or scale well, is around communication (the person seeing the bug isn’t the person fixing the bug) and automation (every release takes more and more person-power as the list of regression issues gets longer and longer). In 2020 Andy Smith, our Head of Engineering, made a commitment that all QA in Tessian would be performed by Developers. This had a large impact on the CSI team as we test an external application (Outlook) across many different versions and configurations which can affect its UI. So CSI set out a three phase approach for the Development team to absorb the QA processes. (Watch how good we are at naming things in my team.) Short-Term The basic goal here was that the Developers would run through the same steps and processes that were already defined for our QA. This meant a lot of manual testing, manually configuring environments, etc. The biggest learning from our team during this phase was that there needs to be a Developer on an overview board whenever you have a QA department to ensure that test steps actually test the thing you want. We found many instances where an assumption in a test step was made that was incorrect or didn’t fully test something. Medium-Term The idea here was that once the Developers are familiar and comfortable running through the processes defined by the QA department, we would then take over ownership of the actual tests themselves and make edits. Often these edits resulted in the ability to test a functionality with more accuracy or less steps. It also included the ability to stand up an environment that tests more permutations, etc. The ownership of the actual tests also meant that as we changed the steps, we needed to do it with automation in mind. Long-Term Automation. Whether it’s unit, integration, or UI tests, we need them automated. Let a computer run the same test over and over again let the Developers think of ever increasing complexity of what and how we test. Our QA Philosophy Because it would be impossible for us to test every permutation of potential clients’ environments before we release our software (or even an existing client’s environment), we approach our QA with the following philosophies: Software Engineers are in the Best Position to Perform QA This means that the people responsible for developing the feature or bug, are the best people when it comes to writing the test cases needed to validate the change, add those test cases to a release cycle, and to even run the test itself. The why’s of this could be (and probably will be) a whole post. 🙂 Bugs Will Happen We’re human. We’ll miss something we shouldn’t have. We won’t think of something we should have. On top of that, we’ll see something we didn’t even think was a possibility. So be kind and focus on the solution rather than the bad code commit. More Confidence, Quicker Our QA processes are here to give us more confidence in our software as quickly as possible, so we can release features or fixes to our clients. Whether we’re adding, editing, or removing a step in our QA process, we ask ourselves if doing this will bring more confidence to our release cycle or will it speed it up. Sometimes we have to make trade-offs between the two. Never Release the Same Bug Twice Our QA process should be about preventing regressions on past issues just as much as it is about confirming functionality of new features. We want a robust enough process that when an issue is encountered and solved once, that same issue is never found again. In the least, this would mean we’d never have the same bug with the same root cause again. At most it would mean that we never see the same type of bug again, as a root cause could be different even though the loss in functionality is the same. An example of this last point is that if our team works through an issue where a virus scanner is preventing us from saving an attachment to disk, we should have a robust enough test that will also detect this same loss in functionality (the inability to save an attachment to disk) for any cause (for example, a change to how Outlook allows access to the attachment, or another add-in stubbing the attachment to zero-bytes for archiving purposes, etc…) How Did We Do? We started this journey with a handful of unit tests that were all automated in our CI environment. Short-Term Phase During the Short-Term phase, there was an emphasis on new commits ensuring that we had unit tests alongside them. Did we sometimes make a decision to release a feature with only manual tests because the code base didn’t lend itself to unit testability? YES! But we worked hard to always ensure we had a good reason for excluding unit tests instead of just assuming it couldn’t be done because it hadn’t before. Being flexible, while at the same time keeping your long-term goal in mind is key, and at times, challenging. Medium-Term This phase wasn’t made up of nearly as much test re-writing as we had intentionally set out for. We added a section to our pull requests to include links to any manual testing steps required to test the new code. This resulted in more new, manual tests being written by developers than edits to existing ones. We did notice that the quality of tests changed. It’s tempting to say, “for the better”, “or with better efficiency”, but I believe most of the change can more be attributed to an understanding that the tests were now being written for a different audience, namely Developers. They became a bit more abstract and a bit more technical. Less intuitive. They also became a bit more verbose as we get a bad taste in our mouth whenever we see a manual step that says something like, “Trigger an EnforcerFilter” with no description on which one? One that displays something to the user or just the admin? Etc…. This phase was also much shorter than we had originally thought it would be. Long-Term This was my favorite phase. I’m actually one of those software engineers that LOVE writing unit tests. I will attribute this to JetBrains’ ReSharper (I could write about my love of ReSharper all day) interface which gives me oh-so-satisfying green checkmarks as my tests run… I love seeing more and more green checkmarks! We had arranged a long term OKR with Andy, which gave us three quarters in 2021 to implement automation of three of our major modules (Constructor, Enforcer, and Guardian)— with a stretch goal of getting one test working for our fourth major module, Defender. We blew this out of the water and met them all (including a beta module – Architect) in one quarter. It was addicting writing UI tests and watching the keyboard and mouse move on its own. Wrapping it All Up Like many software product companies large and small, Tessian started out with a manual QA department composed of technologists but not Software Engineers. Along the way, we made the decision that Software Engineers need to own the QA of the software they work on. This led us on a journey, which included closer reviews of existing tests, writing new tests, and finally automating a majority of our tests. All of this combined allows us to release our software with more confidence, more quickly. Stay tuned for articles where we go into details about the actual automation of UI tests and get our hands dirty with some fun code.

Read Blog Post

Intelligent Cloud Email Security

Tessian Solutions

Our Customers

Tessian Resources Hub

Tessian Blogs

Engineering Blog

Request a Demo of Tessian Today.

Intelligent Cloud Email Security

Tessian Solutions

Our Customers

Tessian Resources Hub

Tessian Blogs

Engineering Blog