Joe Shaumbaugh

Subscribe to Joe Shaumbaugh: eMailAlertsEmail Alerts
Get Joe Shaumbaugh: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: XML Magazine, Java Developer Magazine, Pharmaceutical News

XML: Article

Balancing IT Needs For Effective Drug Discovery

Balancing IT Needs For Effective Drug Discovery

Programming in script-based languages such as Perl and Tcl is popular in life science disciplines such as bio- and cheminformatics, especially among developers at pharmaceutical and biotech companies. Adding the ability to use XML to pipe these scripts into more robust systems used in many bioinformatics computing environments lowers the entry hurdle for those with little or no programming experience in CORBA, Java, C, and C++. Although XML is no replacement for CORBA, adding XML can bring significant computing benefits along with huge time and cost savings if applied properly.

Often, programming tools that offer the greatest flexibility and ease of use aren't the ones you'd use to deploy solutions to large numbers of people. Typically, tools are chosen to complete specific tasks, with some of the unseen implications given only secondary consideration. In the area of life sciences research, for example, programmers piece together solutions using a variety of different programming languages, scripting tools, and data formats. Perl is a favorite tool of many bioinformaticists due to its parsing capabilities and because most common forms of biological data exist in easy-to-read ASCII formats. Since Perl and HTML work well together, the results are typically distributed via Web interfaces to scientists who depend on the data. While this scenario works well for individual programs and result sets, it becomes a problem when integration across data and tools is the goal.

On the other hand, technologies such as CORBA provide a means for creating robust, scalable solutions that provide interoperability in a secure, stable, runtime environment composed of dissimilar computing platforms, servers, and client applications. This is the key reason for its appeal in the drug discovery business, where technology advancements have enabled the generation of an ever-increasing amount of data from all kinds of sources. The success of pharmaceutical companies today is tied to how effectively they can analyze, integrate, and share this data among research scientists. This is especially true lately, as the pharmaceutical industry continues to consolidate. Mergers among large pharmaceutical companies result in the blending of research groups that approach the generation, manipulation, and storage of data differently.

CORBA has become a standard for building enterprise-scale software solutions in many industries. The Life Sciences Research Domain Task Force at the Object Management Group (OMG) recently adopted the Biomolecular Sequence Analysis (BSA) standard. This is good news for an IT director whose job is to facilitate the integration of data across different domains. As standards are adopted, more vendors will produce software products that interoperate. CORBA provides an object-oriented middleware solution that allows objects and services to interact no matter where they're located on the Web, provided they're designed to communicate through a common interface.

CORBA is an ideal technology for constructing large-scale systems. It also lends itself well to incorporating legacy systems through a process known as wrappering. Here, a programmer defines interfaces and writes code to encapsulate an existing process or program so it can be treated as an object in a larger object-oriented system.

However, CORBA is known to have a steep learning curve, and if systems are not properly designed, solutions can also become rigid, fragile, and expensive. In the CORBA paradigm interfaces are defined using Interface Definition Language (IDL). Once the interfaces are specified and the system is designed, the implementation can be accomplished using a variety of languages, including Java, C++, C, Perl, and Python.

Unfortunately, in rapidly evolving industries the requirements for the system may not be well defined at design time. There's also an increasing need to rapidly incorporate functionality into an existing system, even though the specific functionality may have a very short life span or is likely to change or evolve significantly. In this scenario the extra work required to create robust CORBA-based solutions becomes less attractive. The area of drug discovery research reflects just this type of environment. Rapidly changing technologies are driving frequent revisions in the requirements for information systems to access and analyze data.

Developers working for pharmaceutical companies, whether as employees, contractors, or consultants, are interested in solving a researcher's immediate needs while meeting the higher-level objective of producing a robust, enterprise-wide informatics solution. Ideally, an enterprise framework based on COM, CORBA, or some other middleware technology would provide an API to enable these immediate needs to be met.

Pharmaceutical IT directors are discovering that they need a robust system that can handle data in a generic way, yet allow programmers to rapidly incorporate algorithms, tools, and viewers to manipulate the data in ways that weren't considered when the larger system was designed. The key is to provide informaticists with tools that enable them to rapidly prototype functionality and test it within the context of a larger framework. Increasingly, XML appears to be the flexible data interchange format that could provide a means to link complex data into large, enterprise-scale systems. In fact, over the past few years bioinformatics professionals accustomed to using Perl in conjunction with HTML have migrated toward XML in order to take advantage of its ability to encode semantic information.

A Practical Solution
To better meet the needs of drug discovery researchers, NetGenics has combined its XML-based Service Plug-in with its recently released Gene Expression 1.3, a CORBA-based product designed to query, integrate, filter, and view disparate gene expression data to help identify interesting genes worthy of further investigation. In this combination XML allows in-house programmers to write their own scripts to integrate heterogeneous DNA and protein sequence data and execute sophisticated analysis routines (with annotated results) to satisfy research objectives as they arise.

Gene Expression 1.3 works like an object-oriented, biocentric, smart spreadsheet. Data representing experimental values are included in a spreadsheet format with rows usually symbolizing individual genes and columns containing experimental observations about those genes. Users are able to perform complex calculations and statistical analyses on these data sets by plugging in their own data sources and algorithms.

For instance, a user could plug in a clustering algorithm from a statistical package simply by writing an executable wrapper in a scripting language, such as Perl, and registering it with the system. This new algorithm would then dynamically become available on the client through a drop-down menu. When the user selects the cluster algorithm, the client dynamically generates the appropriate input parameter screens. After the appropriate clustering parameters and data columns are selected, the script is executed and the data is returned to the client spreadsheet in the form of a new column of data containing the clustering details from the algorithm.

The system allows for additional data to be associated with each cell in the spreadsheet. The user interacts with this additional data by double-clicking on a given cell. This data can be formatted in one of many different formats, including text, HTML, and XML. When formatted in XML, the client is capable of reading the tagged data from the XML and launching the appropriate viewer. For example, a specific script might be capable of returning complex data that's best viewed in a tree format. When the client inspects the XML, it would learn that the data is best viewed using the registered tree viewer. The client would then launch the tree viewer with the XML data.

Integrating CORBA and XML
The Service Plug-in API is based on the OMG BSA specification and is divided into the following architectural layers (see Figure1):

  • Script layer: This is an executable script, such as a Perl script, that accepts XML input and returns XML output. Typically the input is sent via the script's standard input stream, and the output is returned through the standard output stream. Any executable program, script, or URL can be registered with the system. The most common languages used include Perl, Tcl, Python C, C++, and Java; however, any language can be used. The only requirement for these scripts is that they be executable via a command line and that their input and output conform to the specific Document Type Definition (DTD) the Service Plug-in expects.
  • Object-to-XML layer: This layer is responsible for rendering objects into XML so they can be passed to the script. It also takes the XML output from the script and converts it into objects.
  • Core Plug-in layer: This layer is responsible for maintaining the metadata about the plug-in, including the analysis type, the types of objects accepted and returned, and information about which script to run. The Core Service Plug-in is a CORBA ser- vice that's primarily responsible for handling the transactional semantics associated with running a script. It accepts requests for specific scripts to be run from the adapter layer. These requests are delivered in the form of CORBA service requests from the adapter layer and contain all the data and parameters required to run the script. The data contained in such requests is processed through an object-to-XML conversion process to produce the XML format the script expects. The Core Service Plug-in creates a service instance for each script to be run (it sends XML input to the service and expects XML output). It makes no assumptions about the input or output content or how the content will be used by the service. Specific aspects of the XML encode information that allows the underlying service to be configured, including the execution of scripts and URLs.
  • Adapter layer: This layer is responsible for adapting the BSA API provided by the Core Plug-in layer to the API required by a client component. The adapter layer is also responsible for performing any batch control that's necessary based on metainformation from the Core Plug-in layer. For example, the Plug-in may require analysis subjects to be passed in one at a time, in which case the adapter will have to iterate over the list of subjects to be analyzed. Service Plug-in instances can be reused in different domains by registering adapters with the Plug-in. The inputs and outputs from each adapter may be different, so each Service Plug-in instance allows the user to register metadata that will allow the appropriate object-to-XML and XML-to-object transformations to be performed. Such transformations allow domain-specific inputs and outputs to be translated to something specific to the service.
  • Trader service: A CORBA Trader serves as the repository of all information needed to run a specific service, which includes metadata such as input parameter information, output types, and the location of the script executable. The user interacts with the Trader via a command line interface based on XML. The user encodes the script registration in an XML document. All aspects of the Plug-in instances are registered with the Trader and may be used by one or more adapters to provide behavior.

In this context XML is simply used to extend the CORBA functionality to incorporate any executable program, script, or URL. Writing scripts that conform to a particular DTD is fairly straightforward, as this requires only the data input and output to be tagged with a few specific tags that indicate their type.

The Service Plug-in takes this approach only at the bottom layer. XML is used only to pass data between easily written analysis scripts and a CORBA service. CORBA networking is used between the other layers. At the moment we think this is the most appropriate mix.

Summary
CORBA and XML complement one another very well. CORBA provides a stable infrastructure that's object-based, while XML provides a representation of data that's human-readable, capable of encoding semantic information, and verifiable against a DTD. CORBA allows objects to interoperate through standard interfaces regardless of where they're located on the Web. XML on the other hand offers a data interchange format that can be easily transformed between one DTD and another. Thus XML can serve as a communication conduit between CORBA objects when a common interface doesn't exist. No solution is perfect, however, and there are some trade-offs. XML is bulky, and there are better ways of encoding information for transmission if size is important. In addition, when representing CORBA objects with XML, the behavior (operations, methods) is lost. The Service Plug-in leverages the best of both CORBA and XML. CORBA is used to provide the framework or backbone. XML is used to allow lightweight analyses to be plugged in to a CORBA service by leveraging a documented I/O pattern.

The relationship between CORBA and XML is expanding as the OMG actively pursues the use of XML through initiatives such as XML Metadata Interchange (XMI). The OMG's IDL is just one way of representing a model. What if, as in the case of the Service Plug-in described above, an XML representation is needed? Wouldn't it be nice if the model could be designed without worrying about all the possible representations?

Automated tools that address just this issue are becoming available, enabling the design of a model to be done in UML (Unified Modeling Lang- uage), with the generation of IDL or XML DTD as additional representations. UML permits semantic specifications that go beyond what's expressible in IDL or XML.

XMI is an OMG standard, allowing representation of UML in XML. XMI is the result of collaboration between industry leaders such as Unisys, IBM, Oracle, Rational, Fujitsu, and others interested in allowing tools to exchange models. The purpose is to provide standardized methods for sharing data between programming and data modeling tools in a collaborative environment. This will allow developers to share data and designs from different development tools in a heterogeneous distributed environment, resulting in shorter development times for large-scale, multivendor solutions built with such tools. This should come as welcome news to IT professionals in life science informatics and other industries with long development cycles.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.