Unstructured Information Management

by Robin Harris | Friday, December 8, 2006 | Clusters, Enterprise, Future Tech, Security & Public Policy | 2 comments

The redoubtable Kevin Closson has a post entitled “Introducing the â€œUnstructured Data Administratorâ€. In it he refers to a study put out by the Independent Oracle Users Group called “Managing the Storage Equation: The Converging Roles of Data and Storage Professionals”. Very provocative title, which, sadly, the study doesn’t back up. But Kevin raises some very good questions.

Kevin also points to a Wikipedia article which quotes a Merrill Lynch study saying that 85% of all business data is unstructured. Clicking down the rabbit hole one comes to a website, sponsored by IBM, on something called UIMA (Unstructured Information Management Architecture).

OK, 15% of your data is managed professionals, and the 85% is ???
Several threads here.

My reading of the IOUG study is that DBAs have no idea what is happening with unstructured data, which doesn’t bode well for the putative convergence of data and storage professionals
If only 15% of business data is in relational databases, and some presumably larger amount is in email, why is such a high percentage of business data kept on the most costly, high-performance storage?
IBM’s UIMA working group is co-led by DARPA, which, given its history with TIA, is probably putting the technology to work to spy on US citizens, in addition to its use in the MADCAT program for tactical foreign language document translation.

Another losing battle for IT?
A simple and persuasive narrative of the IT/LOB fight over the last 50 years is that IT loses every time. IT’s high-volume command-and-control mindset meets business unit needs for ad hoc tools, and after much techie mumbo-jumbo, IT is forced to concede that making money is more important than saving money.

If the combination of expensive DBAs, expensive storage and expensive DBMS rises on to the CFO’s radar, changes will occur. It will take some years, but the target will be too fat to ignore. How the dollar split will shake out over time is anyone’s guess. Oracle will be saying “we’ve automated managment and use cheap storage, buy us!”; EMC will say “we can make any bloated pig of a mismanaged database scream, buy us!”; and DBA’s will be saying “we optimize for the real world in real time, buy us!” They’ll all get whacked, and I’d guess storage the most and DBA’s the least.

The StorageMojo take
Thanks, Kevin, for pointing out the growth and importance of unstructured data. It suggests more reasons for the coming hard landing in data storage.

Comments welcome, of course.

2 Comments

Kevin Closson on Friday, 8 December, 2006 at 7:35 pm

I had to pull out the Roget’s Thesaurus for your comment on my blog post about unstructured data! 🙂

The bottom line in that post, and other messaging on the blog, is the significant forking away from structured data that is taking place. Storage vendors are not making storage that is optimized for the typical RDBMS (at least not OLTP) I/O profile. I’m just trying to get Oracle shops to consider the message.

Thanks for your visit! And as I’ve said before, I love your blog.
Robert Pearson on Saturday, 9 December, 2006 at 2:53 am

I would like to share some information I have about “unstructured” Information. First, a quick Operational Definition.
I use “Information” to refer to information that can produce revenue and “information” to refer to still combined data and Information using Walter A. Shewhart’s data presentation rules:
1. Data has no meaning apart from its context.
2. Data contains both signal and noise. To be able to extract Information,
one must separate the signal from the noise within the data.

Unstructured data has an unknown Information Context. The Information Content may be known but this is often meaningless out of the Context.
There are various ways of attempting to quickly determine Context. Ad hoc Information spaces seem to be the most popular.

How does this apply?
Quoting Ray Ozzie from an ACM interview at:
http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=349&page=3

“But IT really needs to leave it to the line of business to understand what the key collaborative processes are. The line of business really has to understand the role of structured processes versus ad hoc projects, and it has to assist IT in defining what tools are best to enable its structured processes and ad hoc processes. The line of business must understand the skills of the people and how best to match the tools with those skills.
…[snip]…
For example, if somebody did all of his or her work totally in an ad hoc,
unstructured manner, there would be no artifacts that benefited the organization.”

This quote is from page 3 of a 5 page interview. I recommend reading the entire interview.

I have been a big fan of the concept of “Weaving the Pervasive Information Fabric using agents and multi-agents” since I first ran across it.
The closest thing to a home page is:
http://eprints.ecs.soton.ac.uk/3797/

The best write-up is the PDF at:
http://eprints.ecs.soton.ac.uk/3797/01/ohs6-weaving.pdf

A companion PDF is the “Hypermedia by coincidence” at:
http://eprints.ecs.soton.ac.uk/5901/01/ht01-coincidence.pdf

What does all this mean?
The “Pervasive Information Fabric” is required by the IoD (Information on
Demand) infrastructure. The IoD is necessary for the timely creation and rapid dispersal, with Persistence, of transient “ad hoc” Information spaces. The “ad hoc” Information spaces extract Information from “unstructured” data to satisfy User requests. Think of it as JIT (Just In Time) Information.

To deal with the “Fundamental Shift in the Information Paradigm” we need some new concepts. The two key ones are Information Centric and Technology Centric. To manage these two concepts we need the “Lower Metric” definitions of the Unit of Information and the Unit of “Enabling” Technology. The Unit of Information is the key element and the Managed Units of Information are the “Profit Center” of the enterprise.
A Managed Unit of Information is simply a Unit of Information with an SLA
(Service Level Agreement). By definition a Managed Unit of Information must be “Enabled” by a Managed Unit of Technology.

All the IDCs (Internet Data Centers) have to be doing this to survive.