KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

Video metadata: ripe for innovation

Article Featured Image

I learned a new word, "twerk." I ran a query on Google and saw a list of results and a film strip of images pointing to a pop singer's video. I clicked one, learned that twerk was a dance, and killed the video. Out of curiosity, I ran a query for twerk using the YouTube.com search system. I saw a list of videos. I knew that Google offered a more comprehensive video search system. The difference between the two search systems is that YouTube provides a finding service for videos on YouTube. Google Video provides links to videos on YouTube and other video services. I scanned the results lists and made, for me, some new discoveries.

The YouTube.com search function provides ads, thumbnails and very short snippets of text about the video. The text comes from information provided by the person who uploaded the video to YouTube. The YouTube top hit in the results list contained useful information; for example, the performer explained the dance. Despite Google Video search adding links to other services, there was zero value add.

Also, the Google Video results were a jumble of instruction, teens fooling around and commercial promotions. The content in the results list was not sortable. The list was a 1994 Lycos-style collection of links. After a decade of search innovation, the rich media search system from the world's largest search vendor was stuck in time.

What's more, only sketchy basic information was available. Each video thumbnail had the running time in its lower right corner. Whether the video was an amateur or commercial production, or whether it was suitable for children, was not presented. Metadata that would be useful to both a consumer and professional researcher was not to be found.

Metadata for textual information is important. For video, metadata is extremely important. Yet metadata was inadequate for both of Google's public video services.

Watch the images flow

When I learned about Gartner's video management report, I assumed that the issue of video metadata would be dominant. I had learned from SAP's "99 Facts on the Future of Business" (See slideshare.net/sap/99-facts-on-the-future-of-business) that video is the future of information. The list of 99 facts includes this: Ninety percent of all Internet traffic in 2017 will be video.

Consultants are knee deep in video. The September 2013 Gartner Magic Quadrant Report for Enterprise Video Management mentions a number of forward-pointing firms, but the topic of metadata is not a focal point of the analyses of Brightcove, Cisco, Ignite, Kaltura, Kontiki, Polycom, Qumu and the other firms referenced.

My impression is that video metadata boils down to file date and time stamps, user provided information such as a title, and possibly closed-captions if available. Analysis of the images and processing of the audible data is not a widely available function. Metadata for video, as a result, is similar to the challenge astrophysicists face when analyzing a black hole. "Something" is there, but getting details is difficult if not impossible with today's technology. To figure out what is in a video, a person has to watch the images flow in serial fashion. Boring, difficult and expensive—unappetizing words to use when looking for information in non-text indexes. Video is a content type that has increasing importance in the enterprise.


I assume that the volume of video content varies from company to company. Hewlett-Packard's internal analysts reported in 2013 that the average stored data per company was 14.6 petabytes. Google reports that more than 100 hours of video are uploaded to YouTube every minute. The company also invests in video indexing. It is safe to conclude that video content is a big deal. Video metatagging has to improve; otherwise, primitive indexing methods will make potentially important information unfindable. Who can watch videos as quickly as a person can scan text?

In June 2013, The Guardian, a newspaper published in the United Kingdom, posted a page called "A Guardian Guide to Your Metadata." The information page provides some background for Guardian readers who follow the summaries and analyses of documents related to alleged confidential documents obtained by Edward Snowden.

The Guardian explains: "Metadata is information generated as you use technology, and its use has been the subject of controversy since NSA's secret surveillance program was revealed. Examples include the date and time you called somebody or the location from which you last accessed your e-mail. The data collected generally does not contain personal or content-specific details, but rather transactional information about the user, the device and activities taking place. In some cases, you can limit the information that is collected—by turning off location services on your cell phone for instance—but many times you cannot." (See theguardian.com/technology/interactive/2013/jun/12/what-is-metadata-nsa-surveillance#meta=0000000.)

The newspaper squeezes down the notion of metadata to the information included with a Twitter message or "tweet." There is an example of metadata associated with metadata. The example shows how an e-mail from one person can associate the sender with the recipient and make it possible to cross-reference the entities in the e-mail chain.

The Guardian's explanation troubles me because it implies that e-mail and tweets are the primary message types. Based on my experience, The Guardian itself ignores the problems presented by generating metadata for rich media such as software, images, audio files and video. When multiple data types must be processed, even log file information can pose a significant problem. Metadata, in short, is presented as a hassle-free way to get information about information.


The focus of metadata aficionados is on text. The bias is quite strong. I spoke with Avi Meyers, one of the principals of Text Analysis International (TAI) in July 2013. The firm, whose product is VisualText, provides a toolkit that developers can use to perform information extraction, categorization, text mining and other sophisticated operations.

The technology embodied in VisualText originated with Amnon Meyers, the chief technology officer, co-founder of the company and Avi's brother. Avi Meyers told me, "Amnon's career is marked by successive advances in a pragmatic line of NLP research, emphasizing toolkits for NLP."

He explained the company's sophisticated technology in this way: "So who needs a text analyzer? Anyone who needs to process documents, reports, Web pages, e-mail, chats and any other communication. Voice processing is a very hot area, but once the voice is converted to text, something needs to process that text. We are all inundated with textual information. TAI's VisualText technology leads the way to more accurate and complete solutions to this universal challenge. The major application areas are Web, business, law, medicine, military and government. VisualText is ideal for text analysis applications to combat terrorism, narcotics, espionage and nuclear proliferation. Text from speech, e-mail and chat dialogues are other large application areas."

Rich media challenge

Still, video cannot be ignored. The reasons for the existing text bias are easy to enumerate. First, rich media content requires computationally expensive processes like speech-to-text conversion. The text is then processed by the content analysis subsystem. The metadata from the speech-to-text process can then be merged with existing metadata about the rich media object. Dassault Exalead (3ds.com/prod
), among other vendors, offers technology that can do that work. Cost and time are barriers. A company without the almost limitless resources of Google does not generate metadata with that approach. Not even Google has the appetite to dig into the "words" within video.

What about the images within the video? For some applications, recognizing the individuals in a video may be more important than what the audio track captures. On-the-fly image recognition is more computationally expensive than speech-to-text technology. I saw a demonstration of a system that could "watch" a soccer match. When a goal was scored, the system flagged the "event" so that an editor could jump to the point in the file at which the scoring took place. For video editing, that feature is important. How many enterprise knowledge management systems process the audio in a training session and tag the most important question asked during the question-and-answer system? I have not identified a system capable of that type of metatagging.

KMWorld Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues