-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

Automating Data Collection: Reaching Beyond the Firewall

A recent study by IDC revealed information workers typically spend 17.8 hours per week—estimated at $26,700 per worker annually— simply searching and gathering information. A typical large company may spend $5.7 million annually cutting and pasting information from one source to another.

With the shift toward an information economy, globalization and the intensity of competition, automating data collection to support your business intelligence initiatives has never been more important than is today. Benefits include:

  • Substantial cost savings;
  • Rapid data acquisition and distribution;
  • Real-time competitive information; and
  • Populating data warehouses.

Fortunately, the technology to automate your data collection already exists. Intelligent agents are software programs capable of navigating complex data structures, extracting target data with a high degree of precision, then repurposing that data into an actionable format.

Intelligent Agents
Intelligent agents are a proven and robust technology for navigating unstructured data sources, negotiating security protocols, duplicating session variables, submitting forms and extracting gigabytes of data from the deepest crevices of the Web or other data networks.

What makes an agent intelligent? An intelligent agent is autonomous software— proactive, goal-oriented, responsive to its environment and capable of coordinating with other agents. The most adept agents can:

  • Programmatically navigate complex data networks;
  • Read a variety of popular file formats;
  • Identify and extract targeted data; 
  • Repurpose the data into an actionable format;
  • Conserve the use of external resources; and
  • Defend themselves against counterintelligence.

An energy trading company provides a good example of how intelligent agents are used. High-volume trading of crude oil and refined products, natural gas and power requires monitoring hundreds of information sources—subscription services, internal data repositories and Websites posting production data. Some of these sources may need to be queried weekly, others in real-time. The extracted data must then be structured and imported into a database or data warehouse where it can power business intelligence applications.

In the business of energy trading, a few seconds are a lifetime. Numerous factors affect the price of energy—weather, water flows, transmission capacities, etc. Data must be continually collected and integrated in order to make split-second decisions.

A new generation of intelligent agents are being built using integrated development environments (IDEs). IDEs incorporate best practices, network monitoring, debugging and deployment tools. Two separate approaches have evolved to meet market demand: pointand- click and programmatic.

Point-and-click. Point-and-click IDEs typically consist of wizards that guide the user through a linear progression of choices with successive options dependent upon previous choices. The idea is that non-programming staff can create effective agents at less expense than IT staff. It's an attractive argument, at least for simple applications that don't require intricate navigation and precise data targeting.

The user's guide for one point-andclick IDE package recommends a "basic understanding" of programming constructs, debugging, HTML/XML and scripting languages such as JavaScript, VBScript and regular expressions—hardly the skill set of most administrative staff.

"Basic" may be disingenuous given the complexity of navigating and identifying data on the Web, de-constructing conditional logic in client-side script or understanding changes in a Web page that have invalidated a robot's programming. In reality, point-and-click software often disguises the complexity of the problem rather than simplifying the solution. While it's true that administrative staff with no programming experience may be able to build rudimentary agents with simple missions, some Web programming skills are required for anything more complex.

Programmatic. An alternative to pointand- click development leverages SQL (structured query language) syntax to query unstructured data sources. SQL is a powerful declarative programming language with a concise syntax. It has the additional advantage of a broad user base, reducing staffing cost and availability issues.

Employing an SQL-like syntax allows the developer to query a variety of unstructured data sources using nested queries and views, restrict data sets using "where" and having clauses, test against case and if statements, group and order, iterate using arrays, manipulate with string and numeric functions and match data against patterns with regular expressions.

In a report entitled Commercial Information Technology Possibilities written for the Center for Technology and National Security Policy, the author states: "Use of SQL reduces the learning curve for most analysts by not requiring mastery of computer languages that support artificial intelligence- based approaches to symbol manipulation. The strategy is to pull from a variety of source documents that are hosted on the Internet or maintained by subscription services or internal network servers. This data, once located, pinpointed and extracted is then repurposed by converting and storing it in a format of use to the client, such as a spreadsheet or graphic."

Point-and-click development makes sense in situations that don't require the power and flexibility of a declarative language similar to SQL or when programming staff are simply unavailable.

The ROI of your business intelligence projects can improve significantly by finding cost effective ways to collect competitive information from outside your firewall. For many of these data collection tasks, intelligent software agents created with robust software tools should be given serious consideration over manual or traditional programming methods. 


QL2 Software provides solutions for Web mining and unstructured data management to more than 30 Fortune 1000 companies. For more information about QL2 and its award-winning software tool, WebQL, visit www.ql2.com

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues