Quantity Does Matter Records Management for Billions of Documents

In the age of increasing litigation and compliance regulations, being a packrat finally pays off. Hoarding files may protect corporate officers from large fines and jail terms. Yet, it's not enough to just retain documents—it's important that these documents be tracked, disposed of properly at the correct time, and that the entire process be controlled and audited. Combine this with an ever-increasing quantity and complexity of electronic business records and it should be no surprise that the proper management and disposition of these records can be a complex and time-consuming task.

The growth in electronic documents has been climbing with no end in sight:

Companies are more aware of potential liabilities associated with not retaining records;
The average retention period for records continues to get longer; and
New regulations have significant record-keeping requirements.

Challenges

Traditional repositories and methods for dealing with records don't work well with large volumes of records. Records should be categorized as they are stored. Relying on the end user to properly categorize every email or document as they are being created or sent will likely result in end-user frustration and improper categorization. An end user might put everything into a miscellaneous category.

Third-party relational database management systems typically don't perform well when individual database tables (for storage of a record's metadata and pointers to the actual file location) exceed more than a hundred million rows. Plus, dealing with billions of small files on traditional RAID storage and file structures will be inefficient or technically impossible and will likely result in unacceptable performance degradation for managing processes such as backup and recovery.

Most records managers want the ability to approve records set for deletion within a given time period—typically the next 30 days. It is not possible for a full-time records manager to make individual approval decisions against millions of records everyday.

Solutions

Records should be auto-categorized as they are stored with the option for the end user to overwrite the default categorization if needed. Auto-categorization could include category defaults for documents created and sent based on a user's role within the company and/or software, which determines the category based on keywords within the text of the document.

Database table sizing should be configured to segment document metadata and file pointer information based on specific document criteria such as document creation or storage dates, hashing of a specific document index value or to create a new database table when the existing one being updated reaches a certain maximum number of rows.

There are two options for dealing with the issue of back-end storage of lots of small files. One is to use content addressable storage (CAS), but this option can be expensive (particularly since dealing with backups will likely require a duplicate mirrored disk configuration at an off-site location). This option also requires that disk I/O (read/write/delete) is done from within an application that supports a particular CAS vendor—typically by integrating the vendor's proprietary APIs for disk access. Another concern when choosing a specific CAS architecture is the likelihood of being tied to a particular vendor's proprietary technology for the life of the longest document retention period. The second option for dealing with back-end storage of many individual small files is to append them together into a single file and use the aggregate record and cloning architecture.

Aggregate Record and Cloning Architecture

To solve the problem of record disposition approval process against an extremely large number of records, it is often useful to store many individual records with similar retention and disposition criteria as a single aggregate record initially. If record disposition changes, then individual designation requests should result in individual documents being cloned into a new individual record. Index references to the original copy of the cloned documents are removed, rendering the original copy inaccessible within the archive.

Record cloning allows a group of documents to be initially treated as one record and yet retains the flexibility to designate different retention criteria as needed. A typical cloning scenario could involve a financial company, which prints and stores a file electronically every month containing a statement for each of its customers. This entire statement file could initially be treated as one aggregate record, but if a particular statement was needed for litigation purposes, its retention could be extended or set to infinite until the case is resolved.

For almost 40 years, docHarbor has been focused on capturing and preserving documents for thousands of organizations, including most of the Fortune 100. docHarbor's hosted suite of document management solutions has a proven track record for rapid implementation, predictable price and performance, and powerful return on investment. www.docharbor.com

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Super Early Bird Pricing for KMWorld 2026 Available for a Limited Time!
Register NOW for November 16-19. Use code SUPERSAVINGS.

Quantity Does Matter Records Management for Billions of Documents

Mining Business Knowledge From Unstructured Data

Checklist Report - Preparing for Agentic AI: KM Playbook

2026 State of KM & AI Report

More

Agentic AI Meets KM: Revolutionizing Knowledge Discovery and Collaboration

The Context-First Enterprise: Why Knowledge Management Is the Foundation AI Has Been Missing

Agentic AI at the Core: Building Faster, Smarter Search Experiences

Knowledge at Your Fingertips: Building Workflows with Embedded Intelligence

More Webinars

Super Early Bird Pricing for KMWorld 2026 Available for a Limited Time!Register NOW for November 16-19. Use code SUPERSAVINGS.

Quantity Does Matter Records Management for Billions of Documents

Super Early Bird Pricing for KMWorld 2026 Available for a Limited Time!
Register NOW for November 16-19. Use code SUPERSAVINGS.