State transitions in Amazon Web Services’ (AWS) standard Step Function workflows were limiting Amenity’s model development cycle. Migrating to AWS Step Function Express workflows enabled Amenity to run its NLP pipelines at a significant savings in infrastructure and throughput.

By
Uri Tsemach
|
November 15, 2021

Improving NLP Model Efficiency on Serverless Infrastructure

Technology
Improving NLP Model Efficiency on Serverless Infrastructure

State transitions in Amazon Web Services’ (AWS) standard Step Function workflows were limiting Amenity’s model development cycle. Migrating to AWS Step Function Express workflows enabled Amenity to run its NLP pipelines at a significant savings in infrastructure and throughput.

This project is a great example of how to achieve high scalability by migrating from standard Step Functions to Step Functions Express within Amazon Web Services (AWS). Here are the results:

  • Increased processing speed times 15
  • A complete pipeline that used to take around 45 minutes, now takes about 3 minutes
  • Backtesting (BDD) went from taking more than 1 hour to about 6 minutes
  • Unit testing (TDD) went from around 25 minutes to about 30 seconds
  • Continuous integration, first migration to second migration: went from around 25 minutes to  about 6 minutes

Not only did this migration significantly increase our throughput, it also removed the need for users to coordinate workflow processes to create a build.

The Amenity Process

Amenity develops enterprise NLP platforms for finance, insurance, and media industries that extract critical insights from mountains of documents. With our software solutions, we provide the fastest, most accurate and scalable way for businesses to get a human-level understanding of information from text.

Amenity’s models are developed with a Test-Driven Development (TDD) and Behavior Driven Development (BDD) approach in order to verify the model accuracy, precision and recall throughout the model lifecycle—from creation to production and maintenance.

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

One of the actions in the Amenity model development cycle is backtesting, which is mainly part of our continuous integration (CI) process. The CI process is responsible for running Amenity’s tests and verifying that the model [MOU1] [JG2] performs as expected. (A part of it is running the reviews for all models.)

The backtesting process runs hundreds of thousands of annotated examples in each “code build”. In order to perform such a big process, we used the Step Functions default workflow.

Limitations of the Step Function Standard Workflow

We found that Step Functions standard workflow has a bucket of 5,000 state transitions with a refill rate of 1,500. Each annotated example has around 10 state transitions, which creates millions of state transitions per code build. Since state transitions are limited and couldn’t be increased to an amount that satisfied our needs, we often faced delays and timeouts. Developers had to coordinate their work with each other, slowing down the entire development cycle.

In addition, we needed to change the way each step in the pipeline is triggered — from an async call to a sync API call.

Step Functions Express Workflow Solution

When a model developer merges their new changes, the CI process starts backtesting for all existing models.

For each model, the backtesting process checks if the model review items were already uploaded and saved in the Amazon Simple Storage Service (S3) cache. The inspection is made by a unique key representing the list of items. Once a model is reviewed, the review items will be changed rarely. We want to avoid uploading its items every time.

If the review items haven’t been uploaded yet, it[MOU2] [JG3] the backtesting process uploads them and triggers an unarchive process in order to use it in the execution phase.

When the items are uploaded, an execute request is sent with its review items key through Amazon API Gateway.

The request is forwarded to an AWS Lambda function, responsible for validating the request and creates and inserts a job message to an Amazon Simple Queue Service (SQS) queue.

The SQS messages are consumed by a limited amount of concurrent Lambda functions, which synchronously invokes a Step Function. The reason the number of lambdas is limited is to ensure that the amount limit of lambdas in the production environment has not been reached.

When an item is finished in the Step Function, it creates an SQS message containing a notification message. This SQS message is inserted into a queue and consumed as a batch of messages by a Lambda function. The function aggregates the messages by the end-users and sends for each user an AWS IoT message containing all the relevant messages.

Figure 1: Step Functions Express Workflow Solution

Main Step Function Express Workflow Pipeline

In order to change from async to sync processing, we had to replace SNS + SQS with an API gateway.

The following diagram shows the process of a single document in Step Function express:

  • Generate a document ID for the current execution
  • Perform base NLP analysis by calling another Step Function Express wrapped by an API Gateway
  • Reformat the response to be the same as all other “logic” steps results
  • Verify the result by the “Choice” state - if failed go to end, otherwise, continue
  • Perform the Amenity core NLP analysis in 3 model executions: Group, Patterns, BL (Business Logic)
  • For each of the model execution steps: Check if the result is OK - if failed go to end, otherwise continue
  • Return a formatted result at the end
Figure 2: Workflow for a Single Document

Base NLP Analysis Step Function Express

For our base NLP analysis, we use Spacy. The following diagram shows how we used it in Step Function Express:

  • We start by checking if we have already analyzed the given text and it exists in the cache
  • If it exists, return the cached result
  • If it doesn’t exist, split the text to a limited size (we can get a request to analyze a very large text. Spacy is limited by the size of text it can analyze, so we have to split it to a limited size)
  • All the texts parts are analyzed in parallel by Spacy
  • When all the parts were analyzed we merge the results into a single analyzed document and save it to the cache
  • If there was an exception during the process we handle it in the “HandleStepFunctionExceptionState”
  • We return a response for the user with a reference to the analyzed document or with an error message if there was an exception
Figure 3: Base NLP Analysis for a Single Document

Conclusion

By migrating from standard Step Functions to Step Functions Express, Amenity made their process 15 times faster. A complete pipeline that used to take around 45 minutes with standard Step Functions, now completes in about 3 minutes with Step Functions Express.

The previous limitations caused the users to have to sync with each other with just a single execution at a time. (If they did not do this, their build would have failed.) The migration removes this limit, and now users execute processes as they like.

As we mentioned before, our CI process contains two parts (which run in parallel):

  • Unit tests (TDD) - took around 25 mins (P95)
  • Backtesting tests (BDD) - took more than 1 hour (P95)

We had two phases of migrations: the backtesting phase and the unit test phase. The backtesting part was reduced from more than one hour to approximately six minutes (P95). This migration was deployed on August 10.

The unit test part was reduced from around 25 mins to around 30 sec. This migration was deployed on September 14.

We can see the effect in the following diagram:

Figure 4: Process Time Reduced from 50 Minutes to 6 Minutes

After the first migration, the CI was limited by the unit tests which took about 25 mins. When the second migration was deployed the time has reduced to about 6 mins (P95).

Check out Amenity’s ESG Spotlight and NLP analysis on other financial trends.

This communication does not represent investment advice. Transcript text provided by S&P Global Market Intelligence.

Copyright ©2021 Amenity Analytics.