2009 SURF – Brad Nock | Computer Science

Position-Based and Keyword-Based Parsing for Automatic Retrieval of Web Page Information

Presentation at Elon Student Undergraduate Research Forum, Spring 2009

Bradford P. Nock (Dr. Shannon Duvall) Department of Computer Science

Large amounts of easily accessible information have made the Internet an unwieldy asset in the search for desired information. The diverse structures of web pages make them very difficult to reliably seek out information automatically. The huge amounts of pages make it impossible for people to search through them manually. One solution to this problem, and the focus of this research, is to develop and evaluate a computer program that is able to adapt to variations in page structure and dependably find requested information.

This research aims to discover an accurate method of retrieving information from web pages
where page structure and location of content is not known and is dissimilar between pages. Three methods were used to retrieve information from the test set of web pages. The first method, called position-based parsing, worked by searching for familiar HTML tags and structures. For example, a piece of desired information may be known to commonly be located in lists on web pages. The program could then look for lists within the HTML and pull out that information.

The second method, keyword-parsing, searched for specific domain-appropriate keywords. One example could be the keyword “features” where the desired information might be features of a software product. The program could then pull out the surrounding web page material wherever it encountered an instance of the keyword. The third method was a hybrid of the previous two methods and searched for a familiar structure, then tried to find a keyword within a certain distance from the structure.

The accuracy of these three methods will be established through manually comparing the results of running the program to manually finding the correct information by hand. Statistical analysis will then be performed to demonstrate which methods were most effective. It is hypothesized that the hybrid method will be more successful than the other two methods. We will present the results of these three methods for web pages that describe open-source projects. The resulting algorithms can be used to automatically collect information on these open-source projects based on their websites.