Elon Computing Sciences

Position-Based and Keyword-Based Parsing for Automatic Retrieval of Web Page Information

Presentation at Elon Student Undergraduate Research Forum, Spring 2009

Bradford P. Nock (Dr. Shannon Duvall) Department of Computer Science

Large amounts of easily accessible information have made the Internet an unwieldy asset in the
search for desired information. The diverse structures of web pages make them very difficult to reliably
seek out information automatically. The huge amounts of pages make it impossible for people to search
through them manually. One solution to this problem, and the focus of this research, is to develop and
evaluate a computer program that is able to adapt to variations in page structure and dependably find
requested information.

This research aims to discover an accurate method of retrieving information from web pages
where page structure and location of content is not known and is dissimilar between pages. Three methods
were used to retrieve information from the test set of web pages. The first method, called position-based
parsing, worked by searching for familiar HTML tags and structures. For example, a piece of desired
information may be known to commonly be located in lists on web pages. The program could then look for
lists within the HTML and pull out that information.

The second method, keyword-parsing, searched for specific domain-appropriate keywords. One
example could be the keyword “features” where the desired information might be features of a software
product. The program could then pull out the surrounding web page material wherever it encountered an
instance of the keyword. The third method was a hybrid of the previous two methods and searched for a
familiar structure, then tried to find a keyword within a certain distance from the structure.

The accuracy of these three methods will be established through manually comparing the results of
running the program to manually finding the correct information by hand. Statistical analysis will then be
performed to demonstrate which methods were most effective. It is hypothesized that the hybrid method
will be more successful than the other two methods. We will present the results of these three methods for
web pages that describe open-source projects. The resulting algorithms can be used to automatically
collect information on these open-source projects based on their websites.