What is web-crawler?
An Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
A web crawler (also known as a spider or spiderbot) starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit next. This is how search engines 'discover' and keep track of the billions of pages on the web.
graph LR
Center["What is web-crawler?"]:::main
Rel_search_engine["search-engine"]:::related -.-> Center
click Rel_search_engine "/terms/search-engine"
Rel_keyword_research["keyword-research"]:::related -.-> Center
click Rel_keyword_research "/terms/keyword-research"
Rel_sorting_algorithm["sorting-algorithm"]:::related -.-> Center
click Rel_sorting_algorithm "/terms/sorting-algorithm"
classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
linkStyle default stroke:#4b5563,stroke-width:2px;
🧠 Prueba de conocimiento
🧒 Explícalo como si tuviera 5 años
A web crawler is like a tiny robotic explorer that travels from one website to another using links like [bridges](/es/terms/bridges). Every time it finds a new [bridge](/es/terms/bridge) (a link), it crosses it and writes down what it saw on the other side. Thousands of these robots are constantly moving across the web day and night.
🤓 Expert Deep Dive
Crawlers must follow the 'Robots Exclusion Standard' (robots.txt), which tells them which parts of a site they are allowed to visit. Key challenges for crawlers include 'Spider Traps' (infinite loops of links) and handling JavaScript-heavy sites that require rendering before the links can be found.