A comprehensive web crawler that explores websites, collects statistics, and analyzes URL patterns
The goal was to explore a real website, collect statistics, and understand how crawlers handle URL discovery, page parsing, and word frequency analysis.
Note: I didn't put the data that I got (crawling specific servers) because it may violate several rules.
Visits and records all unique URL pages across the target website
Identifies the longest page by word count for content analysis
Finds the 50 most frequent words across all crawled pages
Tracks how many pages exist under each subdomain
Unique pages visited: 10,175
Longest page:
URL: http://www.example.com/~cs224
Word count: 104,961
accessibility.example.com (5)
archive.example.com (183)
asterix.example.com (5)
cecs.example.com (10)
cert.example.com (8)
...and more
I know some classes require students to create web crawlers. This project represents my own independent implementation and is shared here for demonstration only. Please do not submit this code as your own work for any course.
The crawler implements intelligent URL discovery by parsing HTML content and extracting links. It maintains a queue of URLs to visit and a set of visited URLs to avoid duplicates.
Each page is analyzed for word count and content. The system tracks the longest page and maintains frequency counts for all words encountered across the entire crawl.
The crawler intelligently categorizes pages by subdomain, providing insights into the structure and organization of the target website.