-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

It’s Time for Real-Time Analytics

Particularly in highly regulated industries such as Financial Services and Healthcare, there is strong demand for real-time analytics over both structured and unstructured data. For clarity purposes, we will use “structured data” to mean the text data that can be parsed into identifiable components, such as relational records, business objects, XML data structures, and the like. For “unstructured data” we mean any binary object such as an image. Raw text without a clear organization is, interestingly enough, somewhere in between the two, as we will see later.

There are many examples of where real-time analytics become extremely useful, including:

  • Real-time fraud detection for financial transactions
  • Data leak prevention
  • Protection decisions
  • Business process triggers, to validate financial transactions over 10,000 USD, for instance

Technology Drivers Impacting Real-Time Analytics

Apart from business drivers, there might be technology drivers to use analytics. A surprising example is metadata enrichment of “to-be-archived” data.

For example, legislation like Dodd-Frank demands the storage of all related communications to financial transactions, including email, voice, trading records, IM data, and the like. The volume of this data is high and is typically archived in read-only archives leveraging WORM technologies. There are several challenges associated with this approach, such as:

  • Metadata enrichment of this data needs to be accurate for that point in time. If you ingest data first and then enrich it, you might get historically incorrect data, which can lead to fines.
  • High-volume ingestion requires a careful design of the system, where real-time decisions about the data significantly improve performance by avoiding repartitioning, re-indexing and the like.

The challenge with real-time analytics is that it is happening in real-time. Real-time processing of structured data is typically efficient, but requires significant software infrastructure. The good news, however, is that there are now excellent open source real-time analytics frameworks available that make real-time analytics more affordable.

Take Apache Storm, for example. It is a distributed real-time processing model that can work on data streams where developers can build specific processing steps that operate on the data in real-time. Storm is fault-tolerant and scales out horizontally. Technologies like Apache Storm reduce the overall cost of ownership of real-time processing and have the flexibility to change the processing model over time (e.g., when legislation changes).

A good example, in the context of archiving, is a partition decision. Assume you archive bank transactions, and, according to the Bank Secrecy Act, you need to implement the “10,000 USD rule” where cash transactions over 10,000 USD need to be reported and have client identity verification. To archive this transaction, the record might need to be enriched with the identity data. It might also be practical to partition these records into a separate archive partition for reporting or retention purposes.

The above example for structured data is extremely easy with a framework like Apache Storm, and although you could easily have some proprietary code that does the same, the modularity of Apache Storm makes it much easier. Transactions over 10,000 USD might kick off several validation processes before the transaction is approved.

New Developments Impacting Real-Time Data Analytics

But what about processing raw text and unstructured data?

There are several developments that make processing of raw text and unstructured data in real-time more compelling and affordable. As previously mentioned, open-source technologies that can process a wide variety of file formats and languages are mature enough to be deployed in large-scale real-time processing systems.

Perhaps a more revolutionary change in text processing is the ability to vectorize text using relatively new algorithms.

Vectorization of text content opens up a wide variety of possibilities, such as the ability to discover similarities in text. This is very useful for fraud detection (for example) in a specific domain such as banking, where a domain-specific training set is fed to a machine-learning algorithm. This leads to more accurate fraud detection in text content.

Real-time processing of text is interesting because vector calculations are fast and it eliminates the need for natural language processing; it doesn’t need to understand the underlying grammar of the text, it merely needs to understand word and/or sentence boundaries, depending on the piece of text you want to represent as a vector.

Catching Malicious Acts in Real-Time

Almost all corporate communication is being monitored these days, and we now have the ability to detect suspicious phone calls using speech to text, instant messages, emails, and the like, all in real-time. We can act before a malicious trader can finish a transaction.

Fishy traders will need to adopt sign language, although it won’t be long before brain-computer interfaces will catch that, too. The vision of spotting crime before it happens, as illustrated in the movie “Minority Report,” might soon become reality.

It’s Time

From business drivers in regulated industries, to technology enablers across all industries, it’s clear that the time to have access to analytics is now—in real-time.

 

 

 

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues