HomeJournalsSearchAlertsAccountHelp
5 of 7 Result List Previous Next



Decision Support Systems
Volume 23, Issue 1 SummaryPlus
May 1998 Article
Pages 41-58 Journal Format-PDF (2916 K)


PII: S0167-9236(98)00035-9
Copyright © 1998 Elsevier Science B.V. All rights reserved

An intelligent personal spider (agent) for dynamic Internet/Intranet searching

Chen Hsinchuna, *, Chung Yi-Ming1, a, Marshall Ramsey2, a and Christopher C. Yang3, b

a MIS Department, Karl Eller Graduate School of Management, University of Arizona, McClelland Hall 430Z, Tucson, AZ 85721, USA
b ECE Department, University of Arizona, Tucson, AZ 85721, USA

Available online 18 September 1998.

Abstract

As Internet services based on the World-Wide Web become more popular, information overload has become a pressing research problem. Difficulties with search on Internet will worsen as the amount of on-line information increases. A scalable approach to Internet search is critical to the success of Internet services and other current and future National Information Infrastructure (NII) applications. As part of the ongoing Illinois Digital Library Initiative project, this research proposes an intelligent personal spider (agent) approach to Internet searching. The approach, which is grounded on automatic textual analysis and general-purpose search algorithms, is expected to be an improvement over the current static and inefficient Internet searches. In this experiment, we implemented Internet personal spiders based on best first search and genetic algorithm techniques. These personal spiders can dynamically take a user's selected starting homepages and search for the most closely related homepages in the web, based on the links and keyword indexing. A plain, static CGI/HTML-based interface was developed earlier, followed by a recent enhancement of a graphical, dynamic Java-based interface. Preliminary evaluation results and two working prototypes (available for Web access) are presented. Although the examples and evaluations presented are mainly based on Internet applications, the applicability of the proposed techniques to the potentially more rewarding Intranet applications should be obvious. In particular, we believe the proposed agent design can be used to locate organization-wide information, to gather new, time-critical organizational information, and to support team-building and communication in Intranets.

Author Keywords: Agents; Machine learning; Spider; Evolutionary programming; Information retrieval; Semantic retrieval; Java; World-Wide Web; Internet; Intranet

Index Terms: Management information systems; Online searching; World Wide Web; Intranets; Artificial intelligence; Information services; Computer architecture; Algorithms; HTML; Java programming language; Interfaces (computer); Computer systems programming; Learning systems; Intelligent personal spiders; Dynamic Internet/Intranet searching; Evolutionary programming

Article Outline

1. Introduction
2. Literature review: Internet spiders
2.1. Client-based search spiders (agents)
2.2. On-line database indexing and searching
3. Research design and algorithms
3.1. Determining score/fitness: the Jaccard's function
3.2. Best first search algorithm
3.2.1. Input anchor homepages and initialize
3.2.2. Determine the best homepage
3.2.3. Explore the best homepage
3.2.4. Iterate until a desired number of homepages is obtained
3.3. Genetic algorithm
3.3.1. Initialize the search space
3.3.2. Cross-over
3.3.3. Mutation
3.3.4. Stochastic selection scheme based on Jaccard's fitness
3.3.5. Convergence
4. Benchmarking experiment
4.1. Complementary searches through exploitation and exploration
4.2. Sparse link constraint
4.3. Exploration and communication are time consuming
5. Dynamic, agent-based interface
5.1. CGI-based user interface
5.2. Java-based user interface
6. Discussion and Conclusion
Acknowledgements
References
Vitae
Vitae
Vitae
Vitae

1. Introduction

Although network protocols and software such as HTTP and Netscape/Mosaic support significantly ease importation and fetching of on-line information sources, their use is accompanied by the disadvantage of the users' not being able to explore and find what they want in an enormous information space [1, 2, 20]. While the Internet services are popular and appealing to many on-line users, difficulties with search are expected to worsen as the amount of on-line information increases. This is mainly due to the problems of information overload and vocabulary differences [11, 4]. Many researchers consider that devising a scalable approach to Internet search is critical to the success of Internet and Intranet services and other current and future National Information Infrastructure (NII) applications [21, 6].

The main information retrieval mechanisms provided by the prevailing Internet WWW-based software are based on either keyword search (e.g., Lycos, AltaVista, and Yahoo servers) or hypertext browsing (e.g., NCSA Mosaic, Netscape Navigator, and Microsoft Internet Explorer). Keyword search often results in low precision, poor recall, and slow response time due to the limitations of indexing and communication methods (bandwidth), controlled language-based interfaces (the vocabulary problem) and the inability of searchers themselves to fully articulate their needs. Furthermore, browsing allows users to explore only a very small portion of the large Internet information space. An extensive information space accessed through hypertext-like browsing can also potentially confuse and disorient its user, the `embedded digression problem,' and it can cause the user to spend a great deal of time while learning nothing specific, the `art museum phenomenon' [3].

Our proposed approach, which is grounded on automatic textual analysis of Internet documents and general-purpose search algorithms, aims to address the Internet search problem by creating dynamic and `intelligent' personal spiders (agents) that take users' requests and perform real-time, customized searches. In particular, best first search was adopted for a local search personal spider and a genetic algorithm was used to develop a global, stochastic personal spider. These personal spiders (agents) could dynamically take users' selected starting homepages and search for the most closely related homepages in the web, based on the links and keyword indexing. Extensive algorithmic revisions and interface development based on CGI/HTML and Java have been performed. This paper summarizes our current research effort.

We believe that the applicability of the proposed techniques to the potentially more rewarding Intranet applications is promising and specifically includes adoption of the proposed agent design in the following Intranet-related areas.

Locating organization-wide information. By restricting the agent to search on servers within the organizational Intranet boundary (i.e., selected URLs or domain names), the proposed agent can be used to help users find only related Intranet sites instead of wandering through the Internet. For large corporations and government agencies, the need to explore and search effectively within their own loosely-connected Intranet sites is real and pressing. By restricting the search space of the proposed agent, the tool could be used to locate organization-wide Intranet information.

Gathering new, time-critical organizational information. The value of the agent-based spider strongly relies on its ability to perform exhaustive and real-time Internet or Intranet searches. Obsolete and dead Web sites can be avoided¯¯the proposed agent imposes a time-out component to avoid connecting to dead sites. Such design is believed to enable the agent to gather new, time-critical organizational information.

Team-building and communications. In our discussion with several Webmasters, it has been suggested that the tool could be used in Intranets for team-building and communication. By launching their own spiders within Intranets, different teams and groups, previously unknown to each other in large corporations, would be able to find Cyberspace collaborators and colleagues, thereby using it effectively as an organizational communication tool.

2. Literature review: Internet spiders

At its inception as the ARPANET, the Internet was conceived primarily as a means of remote log-in and experimentation with telecommunication [2]. However, the predominant usage quickly became e-mail communication. This trend continues into the present form of the Internet, but with increasingly diverse support for collaborative data sharing and distributed, multimedia information access, especially using the World-Wide Web (WWW). Many people consider the Internet and the WWW the backbone of the information superhighway and the window to the cyberspace.

The WWW was developed initially to support physicists and engineers at CERN, the European Particle Physics Laboratory in Geneva, Switzerland [1]. In 1993, when several browser programs (most noticeably the NCSA Mosaic) became available for distributed, multimedia, hypertext-like information fetching, Internet became the preview for a rich and colorful information cyberspace [22]. However, as Internet services based on WWW have become more popular, information overload has become a pressing research problem [2]. The user interaction paradigm on Internet has been shifted from simple hypertext-like browsing (human-guided activity exploring the organization and contents of an information space) to content-based searching (a process in which the user describes a query and a system locates information that matches the description). Many researchers and practitioners have considered Internet/Intranet searching to be one of the more pressing and rewarding areas of research for future NII applications.

Internet searching has been the hottest topic at recent World-Wide Web Conferences. Two major approaches have been developed and experimented with: (1) the client-based search spider (agent) and (2) the on-line database indexing and searching. However, some systems contain components of both approaches.

2.1. Client-based search spiders (agents)

Broadly defined, an `agent' is a program that can operate autonomously and accomplish unique tasks without direct human supervision (similar to human counterparts, such as real estate agents, travel agents, etc.). The basic idea of agent research is to develop software systems which engage and help all types of end users [19]. Such agents might act as `spiders' on the Internet and look for relevant information [10], analyze meeting output on behalf of executives [5], or filter newsgroup articles based on `induced' (or learned) users' profiles [14]. Many researchers have focused on developing scripting and interfacing languages for designers and users such that they can create mobile agents of their own [24]. Some researchers attempt to address the question: "How should agents interact with each other to form digital teamwork?" Other researchers are more concerned about designing agents which are `intelligent' [19, 5].

Several software programs based on the concept of spiders, agents, or softbots (software robots) have been developed. TueMosaic and the WebCrawler are two prominent early examples. Both of them use variations of conventional best first (local) search strategies [17]. DeBra and Post [9] reported tueMosaic v2.42, modified at the Eindhoven University of Technology (TUE) using the `fish search' algorithm, at the First WWW Conference in Geneva. Using tueMosaic, users can enter keywords, specify the depth and width of search for links contained in the current homepages displayed and request the spider agent to fetch homepages connected to the current homepage. The fish search algorithm is a modified best first search method. However, potentially relevant homepages that do not connect with the currently active homepages cannot be retrieved and, when the depth and breadth of search become large (an exponential search), the search space becomes enormous. The inefficiency and local search characteristics of BFS/DFS-based spiders and the communication bandwidth bottleneck on Internet severely constrained the usefulness of such a local search approach. At the Second WWW Conference, Pinkerton reported a more efficient spider (crawler). The WebCrawler extends the tueMosaic's concept to initiate the search using its index and to follow links in an intelligent order. Webcrawler evaluates the relevance of a link based on its similarity to the anchor texts of the user's query. However, problems with local search and communication bottleneck persist.

Due to the proliferation of WWW sites, many newer spiders with different functionalities recently have been developed. The TkWWW robot was developed by Spetka and funded by the Air Force Rome Laboratory [23]. TkWWW robots are dispatched from the TkWWW browser and are designed to search Web neighborhoods to find logically related homepages and return a list of `hot' links. However, their search process is limited to one or two local links from the original homepages. TkWWW robots can also be run in the background to build HTML indexes, compile WWW statistics, collect a portfolio of pictures, or perform any other functions that can be described by TkWWW Tcl extensions. WebAnts, developed by Leavitt at Carnegie Mellon University, investigates the distribution of information collection tasks to a number of cooperating agents (ants). The goal of WebAnts is to create cooperating agents that share searching results and indexing load without repeating each other's effort. The RBSE (Respository-Based Software Engineering) spider was developed by Eichmann and funded by NASA. RBSE spider was the first spider to index documents by content. It uses the Mite program to fetch documents and uses four local search mechanisms: (1) breadth first search from a given URL, (2) limited depth first search from a given URL, (3) breadth first search from unvisited URLs in the database and (4) limited depth first search from unvisited URLs in the database. (For a complete review of other similar Internet spiders/agents, readers are referred to Ref. [8].)

2.2. On-line database indexing and searching

An alternative approach to Internet resource discovery is based on the database concept of indexing and keyword searching. Such systems collect complete or partial Web documents and store them on the host server. These documents are then keyword indexed on the host server to provide a searchable interface. Most popular Internet databases such as Lycos, AltaVista, and Yahoo are based on such a design.

Lycos, developed at CMU [15], uses a combination of spider fetching and simple owner-registration. Internet servers can access the Lycos server and complete registration in a few simple steps. In addition, Lycos uses spiders based on the connections to the registered homepages to identify other un-registered homepages. With this suite of techniques, Lycos has acquired an impressive list of URLs on the Internet. Lycos adopted a heuristics-based indexing approach for these homepages that indexes them based on title, headings and subheadings, 100 most important words, first 20 lines, size in bytes and number of words. However, the success of Lycos also illustrates the vulnerability of the approach and the daunting task of creating `intelligent' and efficient Internet search engines. Its popularity has caused a severe degradation of information access performance, due to the communication bottleneck and the task of finding selected documents in an all-in-one database of Internet homepages.

AltaVista, developed at Digital's Research Laboratories in Palo Alto, combines a fast Web crawler with scalable indexing software to build a large index of the Web. It was made public on 15 December 1995 and has quickly become one of the most comprehensive searchable databases on the Internet. It also provides a full-text index that is updated in real-time for over 13 000 news groups. Although based on similar local search spider algorithms, the AltaVista server has been successful due to its superior hardware platforms and high-end communication bandwidth.

Instead of taking the all-in-one database approach adopted by Lycos and AltaVista, the Yahoo server represents an attempt to partition the Internet information space to provide meaningful subject categories (e.g., science, entertainment, engineering, etc.). However, its manually-created subject categories are limited in their granularity and the process of creating such categories is cumbersome and time-consuming. The demand to create up-to-date and fine-grained subject categories and the requirement that an owner place a homepage under a proper subject category has significantly hampered Yahoo's success and popularity.

3. Research design and algorithms

Funded by the ongoing Illinois Digital Library Initiative project [21, 7], this research aims to create `intelligent' Internet personal search agents that can be deployed on the Web for efficient, timely, and optimal searches. We planned to answer the following two general research questions: (1) Can an Internet spider be designed to take individual users' requests and perform a global, optimal search on the Internet (i.e., not restricted by the links connected to the starting/anchor homepages)? and (2) Can a dynamic, agent-based interface be designed to allow users to present requests, evaluate intermediate results and perform analysis during personal spider search sessions?

After a careful evaluation of many general-purpose search algorithms, a genetic algorithm (GA), which featured a global, stochastic search process, was examined in detail. The complex and dynamic nature of the Internet/WWW appears to be suited for application of such an algorithm. A comparison of a GA-based personal spider and a conventional best first search (BFS) spider was performed in our experiment.

An earlier version of a CGI/HTML interface for GA/BFS spiders revealed the inadequacy of the stateless, static HTML interface. A prototype interface based on Java was, therefore, designed to allow dynamic interactions between users and personal agents.

In this section, we describe the search algorithms implemented. Our CGI/HTML and Java interface will be illustrated in the next section.

3.1. Determining score/fitness: the Jaccard's function

In order to determine the `goodness' (or fitness, using GA terminology) of a given new homepage, a Jaccard's similarity function was adopted [18]. Each homepage was represented as a weighted vector of keywords, that had been automatically indexed by our system and connecting links. A new homepage fetched by the system was compared with the anchor/starting homepages to determine whether or not it was promising. A new homepage which is more similar to the starting homepages was considered more promising, thus, it was explored first. The Jaccard's functions adopted were based on the combined (equal) weights of the Jaccard's score from links and the Jaccard's score from keywords.

Jaccard's scores from links. Given two homepages, A and B, and their connected links/URLs, X=(x1,x2,...,xm) and Y=(y1,y2,...,yn), the Jaccard's score between A and B based on links was computed as follows:

(1)

where #(S) indicates the cardinality of set S.

Jaccard's scores from keywords. For a given homepage, terms were identified based on an automatic indexing procedure developed in our previous research [7]. Term frequency (tf) and inverse document frequency (idf), term weighting heuristics also adopted in such popular searchable databases as Lycos, were then computed. Term frequency, tfij, represents the number of occurrences of term j in document (homepage) i. Homepage frequency, dfj, represents the number of homepages in a collection of N homepages in which term j occurs. The combined weight of term j in homepage i, dij, was computed as follows:

(2)

where wj represents the number of words in term j and N represents the total number of homepages connected to the starting homepages.Representing each homepage as a weighted vector of keywords, the Jaccard's score between homepages A and B based on keyword was computed as follows:

(3)

where L is the total number of terms.

The combined Jaccard's score between any two homepages, A and B, was a weighted summation of the above two Jaccard's scores, i.e.,

(4)

3.2. Best first search algorithm

Two search algorithms, best first search and a genetic algorithm, were investigated in detail. The best first search algorithm was developed to simulate the various client-based spiders developed in earlier studies and were used as a benchmark for comparison. The genetic algorithm was adopted to enhance the global, optimal search capability of existing Internet spiders.

Best first search is a serial state space traversal method [17]. In our implementation, the algorithm explored the best (based on Jaccard's score of new homepage vs. anchor homepages) homepage at each iteration and terminated when the system had identified the desired number of homepages requested by a user. A sketch of the best first search algorithm adopted in our personal agent is presented below.

3.2.1. Input anchor homepages and initialize

Initialize an iteration counter k to 0. Obtain a desired number of homepages from users and a set of input anchor homepages, (input1, input2,..., inputm). These input homepages represent the users' preferred starting points for Internet search and their interests. Texts of homepages are fetched over the network in real time via Lynx HTTP communication software and homepages (URLs) connected from these input homepages are extracted and saved in the unexplored homepage queue, H, where H=(h1, h2,..., hn).

3.2.2. Determine the best homepage

Based on the Jaccard's function described earlier, determine the best homepage, p, in H, which has the highest Jaccard's score among all the homepages in H and save it as outputp. This homepage is considered most similar to the anchor homepages in both keywords and links, thus, it should be explored first.

3.2.3. Explore the best homepage

Fetch the best homepage using Lynx and add its connected homepages to the unexplored homepage queue, H. Increment iteration counter k by one.

3.2.4. Iterate until a desired number of homepages is obtained

Repeat the above steps until k is equal to the total number of homepages requested by the user.

3.3. Genetic algorithm

Genetic algorithms (GAs) [12, 16, 13] are problem-solving systems based on principles of evolution and heredity. Genetic algorithms perform a stochastic evolution process toward global optimization through the use of cross-over and mutation operators. The search space of the problem is represented as a collection of individuals, which are referred as chromosomes. The quality of a chromosome is measured by a fitness function (Jaccard's score in our implementation). After initialization, each generation produces new children based on the genetic cross-over and mutation operators. The process terminates when two consecutive generations do not produce noticeable population fitness improvement (i.e., reach a small threshold value or converge). A sketch of the genetic algorithm adopted for Internet client-based searching is presented below.

3.3.1. Initialize the search space

The GA spider attempts to find other most relevant homepages in the entire Internet search space using the user-supplied starting homepages. Initially, the system saves all the input homepages in a set called Current Generation, CG=(cg1, cg2,..., cgm).

3.3.2. Cross-over

A heuristics-based cross-over operation is then used. New homepages connected to starting homepages in CG set are extracted. Homepages that have been connected to multiple starting homepages (i.e., multiple parents) are considered Cross-over Homepages and saved in a new set, C={c1,c2,...}.

3.3.3. Mutation

In order to avoid trapping in the local minimum that might result from adopting a simple cross-over operator, we have added a heuristics-based mutation procedure to add diversity to the homepage population. A Yahoo spider created in our previous research is used to traverse Yahoo's 14 high-level subject categories (e.g., science, business, entertainment, etc.) and collect several thousands of `mutation seed' homepages in each category. These homepages are indexed using the Web indexing freeware, SWISH (Simple Web Indexing System for Humans). When the GA search algorithm requests a mutated homepage, the system retrieves the top-ranked homepage from homepages in the user-specified category based on the keywords presented in the anchor homepages. This process is similar to performing a search on the Yahoo database in order to suggest new, promising homepages for further exploration. New mutated homepages are saved in the set of Mutation Homepages, M={m1, m2,...}.

The probabilities of mutation and cross-over can vary depending on user needs. Higher cross-over probabilities generally support exploitation of local linkages, while higher mutation probabilities support exploration of the global landscape. Exploitation and exploration are two powerful features of genetic programming [16]. Our default settings for both cross-over and mutation probabilities are 50%.

3.3.4. Stochastic selection scheme based on Jaccard's fitness

Each new cross-over and mutation homepage is evaluated based on the same Jaccard's function. Based on an `elicit selection' procedure [16], homepages which obtain higher fitness values are selected stochastically. A random number generator controlled by a homepage's fitness value is used to select `fitter' homepages for the new generation. Homepages that `survive' the (nature) selection procedure become the new population for the new generation.

3.3.5. Convergence

Repeat the above steps until the improvement in total fitness between two generations is less than a small threshold value (empirically determined). The final converged set of homepages is then presented to users as the output homepages.

4. Benchmarking experiment

In an attempt to examine the quality of results obtained by best first search and genetic algorithm, we performed a set of benchmarking experiments, which compared the performances and efficiency of the best first search and genetic algorithm-based personal spiders. Using a test set of 40 search scenarios, each composed of one to three homepages in different subject areas, we examined the final Jaccard's scores of the BFS/GA-suggested (top 10) homepages and their corresponding CPU times and wall clock times. Higher Jaccard's scores of new homepages would suggest a closer match to a user's stated query interests (i.e., the anchor/starting homepages).

Detailed benchmarking results are presented in Table 1. Fig. 1, Fig. 2 and Fig. 3 show statistical analyses of the final fitness score, CPU time and wall clock time of testing 40 cases.

Table 1. Detailed statistics of the benchmarking results for 40 test cases
(33K)


(9K)

Fig. 1. Statistics of the average Jaccard's scores obtained from 40 test cases by best first search and genetic algorithm.


(9K)

Fig. 2. Statistics of the CPU time for 40 test cases by best first search and genetic algorithm.


(9K)

Fig. 3. Statistics of the wall clock time for 40 test cases by best first search and genetic algorithm.

4.1. Complementary searches through exploitation and exploration

The results show that the output homepages obtained by genetic algorithm had a slightly higher fitness score than those obtained by best first search, but the difference is not significant. The averages of 40 Jaccard's scores for the genetic algorithm and the best first search were 0.08705 and 0.08519, respectively. Although the Jaccard's scores showed no significant difference between the performances of genetic algorithm and best first search, we noticed that about 50% of the homepages obtained from the genetic algorithm were the result of the mutation operation (cross-over and mutation probabilities were set to 50% and 50%, respectively). We found that these homepages, although promising, had never been linked to the starting homepages, thus, they could not have been obtained by any local search spiders (including our best first search spider). This suggests the potential usefulness of the genetic algorithm spider to supplement the local best first search spider, i.e., by permitting combination of the results in both sets.

During our experimentation, we also found that the genetic algorithm performed very similarly to best first search when the mutation probabilities were set low (say 5%) and cross-over probabilities were high (say 95%). With limited mutation operations, the cross-over operation in genetic algorithm accomplished a local exploitation process similar to that of a local best first search. The mutation process appears to have been instrumental in allowing our personal spider to get out of the local search minimum.

4.2. Sparse link constraint

In addition, we also noticed that starting homepages played an important role in determining system output. If starting homepages contained very few and sparsely connected links, best first search spiders tended to get trapped quickly in the Internet search space because of lack of traversal paths. The genetic algorithm, however, was not restricted by such a sparse link constraint because of its mutation operator. On the other hand, for starting homepages that contained rich and dense connections, best first search often resulted in a fruitful final search set. The genetic algorithm only added limited diversity in such a scenario.

4.3. Exploration and communication are time consuming

The genetic algorithm-based spider was significantly more time-consuming than the best first search spider, as shown in Fig. 2 and Fig. 3. The average CPU times for best first search and genetic algorithm were 3 min and 11 s and 1 min and 51 s, respectively. However, due to the communication bottleneck, the average wall clock time for best first search and genetic algorithm were 45 min and 41 s and 23 min and 45 s, respectively. The SWISH keyword search procedure implemented in the genetic algorithm spider caused a significant CPU time requirement. The elite selection procedure and multiple generations also consumed significant CPU cycles.

However, the communication bandwidth (i.e., the actual time to fetch a remote homepage) seemed to be the most significant bottleneck of the entire process for both the best first search spider and the genetic algorithm spider. This deficiency can only be resolved when the current Internet backbones are upgraded.

5. Dynamic, agent-based interface

Currently, we have developed two interfaces for our spiders. One is based on CGI/HTML and the other is based on Java. The CGI/HTML implementation enables image maps and fill-out forms to interact with the HTTP server. However, it is static and does not support dynamic display and interaction during the search process.

On the other hand, Java is an object-oriented, platform-independent, multi-threaded, dynamic, graphical, general-purpose programming environment for the Internet, Intranet and any other complex, distributed network. The Java interface allows us to display lively intermediate spider search results and accept changes of input parameters (e.g., cross-over and mutation probabilities) dynamically. These dynamic, interactive features of Java are crucial to the design of customizable and `intelligent' agents [5].

The two prototype interfaces are summarized below. Readers are encouraged to connect to the University of Arizona Artificial Intelligence Group homepage (http://ai.bpa.arizona.edu/) for actual demonstrations.

5.1. CGI-based user interface

The CGI-based user interface provides fill-in forms that let users submit input to the spiders. Users may request local or global search, as shown in Fig. 4. Invoking the local search spider, as shown in Fig. 5, users are requested to provide up to five starting URLs. Users also need to indicate the desired number of searched homepages and preferred types of servers.


(70K)

Fig. 4. The CGI/HTML spider homepage.


(78K)

Fig. 5. The input homepage of the local best first search spider.

Similarly, invoking the global search spider results in a fill-in form as shown in Fig. 6. The evolution-based option activates the genetic algorithm spider. The probability-based option activates a simulated annealing-based spider developed earlier in our research. In addition to providing starting URLs and the desired number of homepages, users also need to indicate their preferred `mutation seed' database to draw new mutation homepages.


(62K)

Fig. 6. The input homepage of the global genetic algorithm search spider.

After submitting the search request, Fig. 7 shows an output homepage which displays the system-suggested relevant homepages, with a title and keyword summary. Each homepage can then be clicked on for closer examination. Due to the real-time nature of our spiders, all homepages retrieved are `alive,' unlike many `dead' homepages often found in other all-in-one searchable homepage databases.


(109K)

Fig. 7. The output homepage of the global genetic algorithm search spider.

5.2. Java-based user interface

Despite these `intelligent' and customized search processes, the CGI/HTML interface is severely hampered by its lack of dynamic display and interaction. A Java-based user interface was designed to alleviate these problems.

The Java interface homepage is shown in Fig. 8 and Fig. 9. When `global evolution-based search' is clicked on, the system displays a dialog block similar to that designed for the CGI/HTML genetic algorithm spider interface. In addition, users can set their preferred cross-over and mutation probabilities. A time-out mechanism was also introduced to avoid the system from time-consuming, unfruitful connections.


(90K)

Fig. 8. The Java spider homepage.


(55K)

Fig. 9. The control panel for initiating a Java-based genetic algorithm spider.

Fig. 10 shows the window which displays the result of the entire search process dynamically and graphically. The control panel is displayed at the top of the window. All input parameters can be changed during an ongoing search process, producing different search results. The fetched URLs are displayed during each generation (instead of the final results at the last generation) and can be clicked on for real-time evaluation. The system also graphically displays the Jaccard's link score, keyword score and fetch time score for each homepage (in three different colors, not shown in the attached screen dump). A spider-chasing-fly animation is displayed dynamically when our `spider' is out chasing a new homepage (fly).


(75K)

Fig. 10. The display window shows the result of the search process dynamically. Animation is displayed on the upper right-hand corner. Control panel which allows user to change parameters during the process is located at the upper portion. Search results are summarized at the center of the window.

Since we placed our Java-based spider on our server in Summer 1996, the response from our initial test subjects was overwhelming. Users found the Java-based interface to be more interactive, lively and friendly than our earlier CGI/HTML interface. They have reported our spider to be a dynamic, intelligent personal agent, instead of a static, non-customizable Internet database search engine.

6. Discussion and Conclusion

The results from our current experimentation of Internet personal spiders are encouraging. In response to Research Question 1 (designing a global optimal search spider), although the genetic algorithm spider did not out-perform the best first search spider, we found both results to be comparable and complementary. The mutation process introduced in genetic algorithm allows users to find other potential relevant homepages that cannot be explored via a conventional local search process. Regarding Research Question 2, we found the Java-based interface to be a necessary component for designing an interactive and dynamic Internet agent. The CGI/HTML interface is simply too restrictive for such a task.

Although the examples and evaluations presented are mainly based on Internet applications, the applicability of the proposed techniques to the potentially more rewarding Intranet applications should be obvious. In particular, we believe that the proposed agent design can be used to locate organization-wide information, to gather new, time-critical organizational information and to support team-building and communication in Intranets.

In our ongoing effort in the Illinois Digital Library Initiative project, we are in the process of exploring other general-purpose search and classification algorithms for Internet resource categorization and search. Several neural network-based algorithms have been explored, including the Hopfield network and the Kohonen self-organizing map, both are under development in Java.

Acknowledgements

We would like to thank University of Arizona Artificial Intelligence Group members for their participation in our experiment. We also thank Prof. Jerome Yen of Hong Kong University and Prof. Pai-chun Ma of Hong Kong University of Science and Technology for their comments and involvement during system development and testing. This project was supported mainly by the following grants: NSF/ARPA/NASA Digital Library Initiative, IRI-9411318, 1994¯1998 (B. Schatz, H. Chen et al., Building the Interspace: Digital Library Infrastructure for a University Engineering Community); NSF CISE, IRI-9525790, 1995¯1998 (H. Chen, Concept-based Categorization and Search on Internet: a Machine Learning, Parallel Computing Approach); AT&T Foundation Special Purpose Grants in Science and Engineering, 1994¯1995 (H. Chen); and National Center for Supercomputing Applications (NCSA), High-performance Computing Resources Grants, 1994¯1996 (H. Chen).

Vitae

Hsinchun Chen is a Professor of Management Information Systems at the University of Arizona and head of the UA/MIS Artificial Intelligence Group. He is also a Visiting Senior Research Scientist at the National Center for Supercomputing Applications (NCSA). He received an NSF Research Initiation Award in 1992, the Hawaii International Conference on System Sciences (HICSS) Best Paper Award and an AT&T Foundation Award in Science and Engineering in 1994 and 1995. He received the PhD degree in Information Systems from New York University in 1989. Chen has published more than 30 articles covering semantic retrieval, search algorithms, knowledge discovery and collaborative computing. He is a PI of the Illinois Digital Library Initiative project, funded by NSF/ARPA/NASA, 1994¯1998 and has received several grants from NSF, DARPA, NASA, NIH and NCSA. He is the guest editor of IEEE Computer special issue on `Building Large-Scale Digital Libraries' and the Journal of the American Society for Information Science special issue on `Artificial Intelligence Techniques for Emerging Information Systems Applications.' His recent work was featured at Science (Computation Cracks `Semantic Barriers' Between Databases, June 7, 1996), NCSA Access Magazine, HPCWire and Business Week.

Vitae

Yi-Ming Chung received her MS degree in Management Information Systems from the University of Arizona in 1996. Her thesis explored the use of genetic algorithms and neural networks for Internet search and vocabulary switching. She continues her research at the CANIS-Community Systems Laboratory, University of Illinois at Urbana-Champaign as a research programmer. Currently, she is working on automatic subject indexing, concept mapping and vocabulary switching across community repositories in the Internet. Her research interests include digital libraries, neural networks, intelligent agents, object-oriented design and design patterns.

Vitae

Marshall Ramsey is a PhD student at the University of Arizona's Department of Management Information Systems and a member of the UA/MIS Artificial Intelligence Group. He received his BS degree (MIS) in 1993 and MS degree (MIS) in 1997 from the University of Arizona. He was awarded a Research Fellowship from the National Library of Medicine (1996 to 1997) for work in semantic retrieval for large collections of medical documents. His research interests are cross-media and translingual semantic retrieval, data visualization and digital libraries.

Vitae

Christopher C. Yang is an assistant professor in the Department of Computer Science and Information Systems at the University of Hong Kong. He was born in Hong Kong. He received his BS, MS and PhD in Electrical Engineering from the University of Arizona, Tucson, AZ, in 1990, 1992 and 1997, respectively. From 1995 to 1997, he was a research scientist in the UA/MIS Artificial Intelligence Group in the Department of Management Information Systems at the University of Arizona. From 1992 to 1997, he was a research associate in the Intelligent Systems Laboratory in the Department of Electrical and Computer Engineering. His current research interests are digital library, Internet agent, visualization, color image processing, constraint network and computer integrated manufacturing and inspection. He was a member of program committee for the 1997 IEEE International Conference on Systems, Man and Cybernetics, a member of organizing committee for 1998 3rd Asian Conference on Computer Vision and a member of program committee for the 1998 1st Asian Digital Library Workshop.

References

1. T. Berners-Lee, R. Cailliau, A. Luotonen, H.F. Nielsen and A. Secret, The World-Wide Web. Commun. ACM 37 8 (1994), pp. 76¯82. Abstract 

2. C.M. Bowman, P.B. Danzig, U. Manber and F. Schwartz, Scalable internet resource discovery: research problems and approaches. Commun. ACM 37 8 (1994), pp. 98¯107. Abstract 

3. E. Carmel, S. Crawford and H. Chen, Browsing in hypertext: a cognitive study. IEEE Trans. Syst., Man Cybernetics 22 5 (1992), pp. 865¯884. INSPEC 

4. H. Chen, Collaborative systems: solving the vocabulary problem, IEEE Computer, 27 (5) 58¯66, Special Issue on Computer-Supported Cooperative Work (CSCW), May 1994.

5. H. Chen, A. Houston, J. Yen and J.F. Nunamaker, Toward intelligent meeting agents. IEEE Computer 29 8 (1996), pp. 62¯70.

6. H. Chen, B.R. Schatz, Semantic retrieval for the NCSA Mosaic, Proceedings of the Second International World-Wide Web Conference 1994, Chicago, IL, October 17¯20, 1994.

7. H. Chen, B.R. Schatz, T.D. Ng, J.P. Martinez, A.J. Kirchhoff and C. Lin, A parallel computing approach to creating engineering concept spaces for semantic retrieval: the Illinois Digital Library Initiative Project. IEEE Trans. Pattern Anal. Machine Intelligence 18 8 (1996), pp. 771¯782. Abstract 

8. F. Cheong, Internet Agents, New Riders Publishing, Indianapolis, IN, 1996.

9. P. DeBra, R. Post, Information retrieval in the World-Wide Web: making client-based searching feasible, Proceedings of the First International World-Wide Web Conference 1994, Geneva, Switzerland, 1994.

10. O. Etzioni and D. Weld, A softbot-based interface to the Internet. Commun. ACM 37 7 (1994), pp. 72¯79.

11. G.W. Furnas, T.K. Landauer, L.M. Gomez and S.T. Dumais, The vocabulary problem in human¯system communication. Commun. ACM 30 11 (1987), pp. 964¯971. INSPEC Compendex 

12. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989.

13. J.R. Koza, Genetic Programming: on the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, MA, 1992.

14. P. Maes, Agents that reduce work and information overload. Commun. ACM 37 7 (1994), pp. 30¯40. Abstract 

15. Mauldin, Leavitt, Web-agent related research at the CMT, Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval (SIGNIDR-94), August 1994.

16. Z. Michalewicz, Genetic Algorithms+Data Structures=Evolution Programs, Springer-Verlag, Berlin, 1992.

17. J. Pearl, Heuristics: Intelligent Search Strategies for Computer Problem Solving, Addison-Wesley Publishing, Reading, MA, 1984.

18. E. Rasmussen. Clustering algorithms. In Information Retrieval: Data Structures and Algorithms, W.B. Frakes and R. Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, NJ, 1992.

19. D. Rieken. Intelligent agents. Communications of the ACM, 37(7):18¯21, July 1994.

20. B.R. Schatz, A. Bishop, W. Mischo, and J. Hardin. Digital library infrastructure for a university engineering community. In Proceedings of Digital Libraries '94, pages 21¯24, June 1994.

21. B.R. Schatz and H. Chen. Building large-scale digital libraries. IEEE COMPUTER, 29(5):22¯27, May 1996.

22. B.R. Schatz and J.B. Hardin. NSCA Mosaic and the World Wide Web: global hypermedia protocols for the internet. Science, 265:895¯901, 12 August 1994.

23. S. Spetka. The TkWWW robot: Beyond browsing. In Proceedings of the Second World Wide Web Conference, October 17¯20 1994.

24. M.M. Waldrop. Software agents prepare to sift the riches of cyberspace. Science, 265:882¯883, 12 August 1994.

*Corresponding author. Tel.: +1-520-621-4153; e-mail: hchen@bpa.arizona.edu

1Currently a research programmer at the CANIS-Community Systems Laboratory, University of Illinois at Urbana-Champaign. E-mail: ychung@bpa.arizona.edu

2E-mail: mramsey@bpa.arizona.edu

3Currently an assistant professor at the University of Hong Kong. E-mail: chrisy@ece.arizona.edu


Decision Support Systems
SummaryPlus
Article
Journal Format-PDF (2916 K)
Volume 23, Issue 1
May 1998
Pages 41-58


5 of 7 Result ListPreviousNext
HomeJournalsSearchAlertsAccountHelp

Send feedback to ScienceDirect
Software and compilation © 2000 ScienceDirect. All rights reserved.
ScienceDirect® is an Elsevier Science B.V. registered trademark.