Home Big Data Actual-time Medical Trial Monitoring at Medical ink – migrating from Opensearch to Rockset for DynamoDB indexing

Actual-time Medical Trial Monitoring at Medical ink – migrating from Opensearch to Rockset for DynamoDB indexing

Actual-time Medical Trial Monitoring at Medical ink – migrating from Opensearch to Rockset for DynamoDB indexing


Medical ink is a set of software program utilized in over a thousand medical trials to streamline the info assortment and administration course of, with the aim of enhancing the effectivity and accuracy of trials. Its cloud-based digital information seize system permits medical trial information from greater than 2 million sufferers throughout 110 international locations to be collected electronically in real-time from a wide range of sources, together with digital well being data and wearable units.

With the COVID-19 pandemic forcing many medical trials to go digital, Medical ink has been an more and more worthwhile resolution for its capacity to help distant monitoring and digital medical trials. Relatively than require trial members to come back onsite to report affected person outcomes they will shift their monitoring to the house. In consequence, trials take much less time to design, develop and deploy and affected person enrollment and retention will increase.

To successfully analyze information from medical trials within the new remote-first surroundings, medical trial sponsors got here to Medical ink with the requirement for a real-time 360-degree view of sufferers and their outcomes throughout your complete world research. With a centralized real-time analytics dashboard outfitted with filter capabilities, medical groups can take instant motion on affected person questions and opinions to make sure the success of the trial. The 360-degree view was designed to be the info epicenter for medical groups, offering a birds-eye view and strong drill down capabilities so medical groups might preserve trials on monitor throughout all geographies.

When the necessities for the brand new real-time research participant monitoring got here to the engineering staff, I knew that the present technical stack couldn’t help millisecond-latency advanced analytics on real-time information. Amazon OpenSearch, a fork of Elasticsearch used for our software search, was quick however not purpose-built for advanced analytics together with joins. Snowflake, the strong cloud information warehouse utilized by our analyst staff for performant enterprise intelligence workloads, noticed important information delays and couldn’t meet the efficiency necessities of the applying. This despatched us to the drafting board to give you a brand new structure; one which helps real-time ingest and sophisticated analytics whereas being resilient.

The Earlier than Structure

Clinical ink before architecture for user-facing analytics

Medical ink earlier than structure for user-facing analytics

Amazon DynamoDB for Operational Workloads

Within the Medical ink platform, third occasion vendor information, internet functions, cell units and wearable system information is saved in Amazon DynamoDB. Amazon DynamoDB’s versatile schema makes it straightforward to retailer and retrieve information in a wide range of codecs, which is especially helpful for Medical ink’s software that requires dealing with dynamic, semi-structured information. DynamoDB is a serverless database so the staff didn’t have to fret concerning the underlying infrastructure or scaling of the database as these are all managed by AWS.

Amazon Opensearch for Search Workloads

Whereas DynamoDB is a good selection for quick, scalable and extremely obtainable transactional workloads, it isn’t one of the best for search and analytics use circumstances. Within the first technology Medical ink platform, search and analytics was offloaded from DynamoDB to Amazon OpenSearch. As the quantity and number of information elevated, we realized the necessity for joins to help extra superior analytics and supply real-time research affected person monitoring. Joins usually are not a firstclass citizen in OpenSearch, requiring quite a few operationally advanced and dear workarounds together with information denormalization, parent-child relationships, nested objects and application-side joins which might be difficult to scale.

We additionally encountered information and infrastructure operational challenges when scaling OpenSearch. One information problem we confronted centered on dynamic mapping in OpenSearch or the method of robotically detecting and mapping the info forms of fields in a doc. Dynamic mapping was helpful as we had numerous fields with various information sorts and have been indexing information from a number of sources with totally different schemas. Nonetheless, dynamic mapping typically led to surprising outcomes, akin to incorrect information sorts or mapping conflicts that pressured us to reindex the info.

On the infrastructure facet, regardless that we used managed Amazon Opensearch, we have been nonetheless liable for cluster operations together with managing nodes, shards and indexes. We discovered that as the scale of the paperwork elevated we wanted to scale up the cluster which is a handbook, time-consuming course of. Moreover, as OpenSearch has a tightly coupled structure with compute and storage scaling collectively, we needed to overprovision compute sources to help the rising variety of paperwork. This led to compute wastage and better prices and lowered effectivity. Even when we might have made advanced analytics work on OpenSearch, we might have evaluated extra databases as the info engineering and operational administration was important.

Snowflake for Information Warehousing Workloads

We additionally investigated the potential of our cloud information warehouse, Snowflake, to be the serving layer for analytics in our software. Snowflake was used to offer weekly consolidated studies to medical trial sponsors and supported SQL analytics, assembly the advanced analytics necessities of the applying. That stated, offloading DynamoDB information to Snowflake was too delayed; at a minimal, we might obtain a 20 minute information latency which fell outdoors the time window required for this use case.


Given the gaps within the present structure, we got here up with the next necessities for the substitute of OpenSearch because the serving layer:

  • Actual-time streaming ingest: Information adjustments from DynamoDB must be seen and queryable within the downstream database inside seconds
  • Millisecond-latency advanced analytics (together with joins): The database should have the ability to consolidate world trial information on sufferers right into a 360-degree view. This contains supporting advanced sorting and filtering of the info and aggregations of hundreds of various entities.
  • Extremely Resilient: The database is designed to take care of availability and reduce information loss within the face of varied forms of failures and disruptions.
  • Scalable: The database is cloud-native and may scale on the click on of a button or an API name with no downtime. We had invested in a serverless structure with Amazon DynamoDB and didn’t need the engineering staff to handle cluster-level operations shifting ahead.

The After Structure

Clinical ink after architecture using Rockset for real-time clinical trial monitoring

Medical ink after structure utilizing Rockset for real-time medical trial monitoring

Rockset initially got here on our radar as a substitute for OpenSearch for its help of advanced analytics on low latency information.

Each OpenSearch and Rockset use indexing to allow quick querying over massive quantities of information. The distinction is that Rockset employs a Converged Index which is a mix of a search index, columnar retailer and row retailer for optimum question efficiency. The Converged Index helps a SQL-based question language, which permits us to fulfill the requirement for advanced analytics.

Along with Converged Indexing, there have been different options that piqued our curiosity and made it straightforward to start out efficiency testing Rockset on our personal information and queries.

  • Constructed-in connector to DynamoDB: New information from our DynamoDB tables are mirrored and made queryable in Rockset with only some seconds delay. This made it straightforward for Rockset to suit into our present information stack.
  • Means to take a number of information sorts into the identical subject: This addressed the info engineering challenges that we confronted with dynamic mapping in OpenSearch, making certain that there have been no breakdowns in our ETL course of and that queries continued to ship responses even when there have been schema adjustments.
  • Cloud-native structure: We now have additionally invested in a serverless information stack for resource-efficiency and lowered operational overhead. We have been in a position to scale ingest compute, question compute and storage independently with Rockset in order that we not have to overprovision sources.

Efficiency Outcomes

As soon as we decided that Rockset fulfilled the wants of our software, we proceeded to evaluate the database’s ingestion and question efficiency. We ran the next checks on Rockset by constructing a Lambda perform with Node.js:

Ingest Efficiency

The widespread sample we see is numerous small writes, ranging in measurement from 400 bytes to 2 kilobytes, grouped collectively and being written to the database incessantly. We evaluated ingest efficiency by producing X writes into DynamoDB in fast succession and recording the common time in milliseconds that it took for Rockset to sync that information and make it queryable, often known as information latency.

To run this efficiency check, we used a Rockset medium digital occasion with 8 vCPU of compute and 64 GiB of reminiscence.

Streaming ingest performance on Rockset medium virtual instance with 8 vCPU and 64 GB RAM

Streaming ingest efficiency on Rockset medium digital occasion with 8 vCPU and 64 GB RAM

The efficiency checks point out that Rockset is able to attaining a information latency underneath 2.4 seconds, which represents the length between the technology of information in DynamoDB and its availability for querying in Rockset. This load testing made us assured that we might constantly entry information roughly 2 seconds after writing to DynamoDB, giving customers up-to-date information of their dashboards. Previously, we struggled to realize predictable latency with Elasticsearch and have been excited by the consistency that we noticed with Rockset throughout load testing.

Question Efficiency

For question efficiency, we executed X queries randomly each 10-60 milliseconds. We ran two checks utilizing queries with totally different ranges of complexity:

  • Question 1: Easy question on a number of fields of information. Dataset measurement of ~700K data and a pair of.5 GB.
  • Question 2: Advanced question that expands arrays into a number of rows utilizing an unnest perform. Information is filtered on the unnested fields. Two datasets have been joined collectively: one dataset had 700K rows and a pair of.5 GB, the opposite dataset had 650K rows and 3GB.

We once more ran the checks on a Rockset medium digital occasion with 8 vCPU of compute and 64 GiB of reminiscence.

Query performance of a simple query on a few fields of data. Query was run on a Rockset virtual instance with 8 vCPU and 64 GB RAM.

Question efficiency of a easy question on a number of fields of information. Question was run on a Rockset digital occasion with 8 vCPU and 64 GB RAM.

Query performance of a complex unnest query. Query was run on a Rockset virtual instance with 8 vCPU and 64 GB RAM.

Question efficiency of a posh unnest question. Question was run on a Rockset digital occasion with 8 vCPU and 64 GB RAM.

Rockset was in a position to ship question response instances within the vary of double-digit milliseconds, even when dealing with workloads with excessive ranges of concurrency.

To find out if Rockset can scale linearly, we evaluated question efficiency on a small digital occasion, which had 4vCPU of compute and 32 GiB of reminiscence, in opposition to the medium digital occasion. The outcomes confirmed that the medium digital occasion lowered question latency by an element of 1.6x for the primary question and 4.5x for the second question, suggesting that Rockset can scale effectively for our workload.

We appreciated that Rockset achieved predictable question efficiency, clustered inside 40% and 20% of the common, and that queries constantly delivered in double-digit milliseconds; this quick question response time is important to our person expertise.


We’re presently phasing real-time medical trial monitoring into manufacturing as the brand new operational information hub for medical groups. We now have been blown away by the pace of Rockset and its capacity to help advanced filters, joins, and aggregations. Rockset achieves double-digit millisecond latency queries and may scale ingest to help real-time updates, inserts and deletes from DynamoDB.

In contrast to OpenSearch, which required handbook interventions to realize optimum efficiency, Rockset has confirmed to require minimal operational effort on our half. Scaling up our operations to accommodate bigger digital cases and extra medical sponsors occurs with only a easy push of a button.

Over the subsequent 12 months, we’re excited to roll out the real-time research participant monitoring to all prospects and proceed our management within the digital transformation of medical trials.


Supply hyperlink


Please enter your comment!
Please enter your name here