Web Crawler Project

A comprehensive web crawler that explores websites, collects statistics, and analyzes URL patterns

PythonWeb ScrapingData AnalysisURL Parsing
Web Crawler Project

About

The goal was to explore a real website, collect statistics, and understand how crawlers handle URL discovery, page parsing, and word frequency analysis.

Note: I didn't put the data that I got (crawling specific servers) because it may violate several rules.

What It Does

URL Discovery

Visits and records all unique URL pages across the target website

Page Analysis

Identifies the longest page by word count for content analysis

Word Frequency

Finds the 50 most frequent words across all crawled pages

Subdomain Tracking

Tracks how many pages exist under each subdomain

What The Report Looks Like

Results

Unique pages visited: 10,175

Longest page:

URL: http://www.example.com/~cs224

Word count: 104,961

Top 50 Most Frequent Words

events (27623)
search (13679)
isg (10738)
april (9989)
day (9925)
research (9696)
talks (8015)
reunion (8004)
outlook (7991)
next (7562)
data (7362)
us (6338)
news (6290)
read (5884)
received (5786)
upcoming (5639)
navigation (5602)
talk (5527)
views (5495)
file (5484)
calendar (5437)
enter (5321)
export (5286)
information (5264)
systems (5019)
may (4989)
jump (4810)
scheduled (4714)
contact (3894)
database (3853)
students (3741)
faculty (3648)
award (3538)
projects (3516)
group (3497)
people (3487)
best (3272)
two (3219)
courses (3182)
home (3150)
ieee (3092)
welcome (3087)
like (3024)
publications (3007)
fellow (2947)
recent (2941)

Subdomains

accessibility.example.com (5)

archive.example.com (183)

asterix.example.com (5)

cecs.example.com (10)

cert.example.com (8)

...and more

Disclaimer

I know some classes require students to create web crawlers. This project represents my own independent implementation and is shared here for demonstration only. Please do not submit this code as your own work for any course.

Technical Implementation

URL Discovery & Parsing

The crawler implements intelligent URL discovery by parsing HTML content and extracting links. It maintains a queue of URLs to visit and a set of visited URLs to avoid duplicates.

Content Analysis

Each page is analyzed for word count and content. The system tracks the longest page and maintains frequency counts for all words encountered across the entire crawl.

Subdomain Tracking

The crawler intelligently categorizes pages by subdomain, providing insights into the structure and organization of the target website.