• Home
  • Become a Hacker
    • Get Started
    • Hacker Mindset
    • Roadmap
    • Simple Setup – Hacker 101
    • Types of Hackers
    • Recommended Courses
  • Boot People Offline
  • Courses
    • All Hacking Courses
    • Cyber Security School
  • CTF
    • Beginners to Advanced Guide
    • Create your own CTF box
    • Field and Resources Guide
    • Platforms & Wargames
    • Tools Used for Solving CTF
    • Writeups
  • Dark Web
    • Beginners Guide
    • Darknet Markets
    • Darkweb 101 (Anonymity Guide)
    • Dark Web OSINT Tools
    • Hacking Forums
    • Latest News
    • Onion Links
  • Hacker Gadgets
  • Hacking Books
  • Tools Directory
Menu
  • Home
  • Become a Hacker
    • Get Started
    • Hacker Mindset
    • Roadmap
    • Simple Setup – Hacker 101
    • Types of Hackers
    • Recommended Courses
  • Boot People Offline
  • Courses
    • All Hacking Courses
    • Cyber Security School
  • CTF
    • Beginners to Advanced Guide
    • Create your own CTF box
    • Field and Resources Guide
    • Platforms & Wargames
    • Tools Used for Solving CTF
    • Writeups
  • Dark Web
    • Beginners Guide
    • Darknet Markets
    • Darkweb 101 (Anonymity Guide)
    • Dark Web OSINT Tools
    • Hacking Forums
    • Latest News
    • Onion Links
  • Hacker Gadgets
  • Hacking Books
  • Tools Directory
Search
Close
  • Home
  • 2015
  • August
  • 14
  • Building your own Web Crawler

Building your own Web Crawler

August 14, 2015November 19, 2017 Comments Off on Building your own Web Crawler
Building your own Web Crawler how to build a web crawler python coding python hax python web crawler search engine hacking web crawler web crawler source code

Web crawler is a program that browses the Internet (World Wide Web) in a predetermined, configurable and automated manner and performs given action on crawled content. Search engines like Google and Yahoo use spidering as a means of providing up-to-date data.

Webhose.io, a company which provides direct access to live data from hundreds of thousands of forums, news and blogs, on Aug 12, 2015, posted the articles describing a tiny, multi-threaded web crawler written in python. This python web crawler is capable of crawling the entire web for you. Ran Geva, the author of this tiny python web crawler says that:

I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

The python based multi-threaded crawler is pretty simple and very fast. It is capable of detecting and eliminating duplicate links and saving both source and link which can later be used in finding inbound and outbound links for calculating page rank. It is completely free and the code is listed below:

 

import sys, thread, Queue, re, urllib, urlparse, time, os, sys
dupcheck = set()
q = Queue.Queue(100)
q.put(sys.argv[1])
def queueURLs(html, origLink):
for url in re.findall(”'<a[^>]+href=[“‘](.[^”‘]+)[“‘]”’, html, re.I):
link = url.split(“#”, 1)[0] if url.startswith(“http”) else ‘{uri.scheme}://{uri.netloc}’.format(uri=urlparse.urlparse(origLink)) + url.split(“#”, 1)[0]
if link in dupcheck:
continue
dupcheck.add(link)
if len(dupcheck) > 99999:
dupcheck.clear()
q.put(link)
def getHTML(link):
try:
html = urllib.urlopen(link).read()
open(str(time.time()) + “.html”, “w”).write(“” % link + “n” + html)
queueURLs(html, link)
except (KeyboardInterrupt, SystemExit):
raise
except Exception:
pass
while True:
thread.start_new_thread( getHTML, (q.get(),))
time.sleep
DOWNLOAD CODE HERE

Save the above code with some name lets say “myPythonCrawler.py”. To start crawling any website just type:

$ python myPythonCrawler.py
1
$ python myPythonCrawler.py http://haxf4rall.com

Sit back and enjoy this web crawler in python. It will download the entire site for you.

Do you like this dead simple python based multi-threaded web crawler? Let us know in comments.

Post navigation

Hacking WiFi – Selecting the best strategy
Windows 10 – Most Important Features you need to know

Related Articles

PHPStan – PHP Static Analysis Tool

- Secure Coding
July 21, 2019July 21, 2019

ShC – Shell Script Compiler

- Secure Coding
July 8, 2019

PassPie – Multiplatform Command-Line Password Manager

- Tricks & How To's
July 8, 2019
hacker gadgets
hacker phone covers

Recent Posts

Cortex-XDR-Config-Extractor - Cortex XDR Config Extractor

Cortex-XDR-Config-Extractor – Cortex XDR Config Extractor

March 20, 2023
NimPlant - A Light-Weight First-Stage C2 Implant Written In Nim

NimPlant – A Light-Weight First-Stage C2 Implant Written In Nim

March 20, 2023
X-force - IBM Security Utilitary Library In Python. Search And Query All Sources: Threat_Activities And Groups, Malware_Analysis, Industries

X-force – IBM Security Utilitary Library In Python. Search And Query All Sources: Threat_Activities And Groups, Malware_Analysis, Industries

March 20, 2023
Thunderstorm - Modular Framework To Exploit UPS Devices

Thunderstorm – Modular Framework To Exploit UPS Devices

March 20, 2023
DataSurgeon - Quickly Extracts IP's, Email Addresses, Hashes, Files, Credit Cards, Social Secuirty Numbers And More From Text

DataSurgeon – Quickly Extracts IP’s, Email Addresses, Hashes, Files, Credit Cards, Social Secuirty Numbers And More From Text

March 19, 2023
FindUncommonShares - A Python Equivalent Of PowerView's Invoke-ShareFinder.ps1 Allowing To Quickly Find Uncommon Shares In Vast Windows Domains

FindUncommonShares – A Python Equivalent Of PowerView’s Invoke-ShareFinder.ps1 Allowing To Quickly Find Uncommon Shares In Vast Windows Domains

March 19, 2023

Social Media Hacking

SocialPath – Track users across Social Media Platforms

SocialPath – Track users across Social Media Platforms

- Social Media Hacking
October 16, 2019October 16, 2019

SocialPath is a django application for gathering social media intelligence on specific username. It checks for Twitter, Instagram, Facebook, Reddit...

SocialScan – Check Email Address and Username Availability on Online Platforms

SocialScan – Check Email Address and Username Availability on Online Platforms

June 17, 2019
Shellphish – Phishing Tool For 18 Social Media Apps

Shellphish – Phishing Tool For 18 Social Media Apps

June 10, 2019July 27, 2019
WhatsApp Hacking using QRLJacking

WhatsApp Hacking using QRLJacking

May 2, 2019May 19, 2019
How to Hack any Facebook Account with Z-Shadow

How to Hack any Facebook Account with Z-Shadow

April 26, 2019June 29, 2020
hacker buffs

About Us

Haxf4rall is a collective, a good starting point and provides a variety of quality material for cyber security professionals.

Join Our Community!

Please wait...
Get the latest News and Hacking Tools delivered to your inbox.
Don't Worry ! You will not be spammed

Active Members

Submit a Tool

Hackers Handbook 2018


Grab your copy here

ABOUT US

Haxf4rall is a collective, a good starting point and provides a variety of quality material for cyber security professionals.

Our primary focus revolves around the latest tools released in the Infosec community and provide a platform for developers to showcase their skillset and current projects.

COMPANY
  • Contact Us
  • Disclaimer
  • Hacker Gadgets
  • LANC Remastered
  • PCPS IP Puller
  • Privacy Policy
  • Sitemap
  • Submit your Tool
Menu
  • Contact Us
  • Disclaimer
  • Hacker Gadgets
  • LANC Remastered
  • PCPS IP Puller
  • Privacy Policy
  • Sitemap
  • Submit your Tool
Live Chat
RESOURCES
  • Attack Process
  • Become a Hacker
  • Career Pathways
  • Dark Web
  • Hacking Books
  • Practice Your Skills
  • Recommended Courses
  • Simple Setup – Hacker 101
Menu
  • Attack Process
  • Become a Hacker
  • Career Pathways
  • Dark Web
  • Hacking Books
  • Practice Your Skills
  • Recommended Courses
  • Simple Setup – Hacker 101
Get Started
TOOLBOX
  • Anonymity
  • Bruteforce
  • DoS – Denial of Service
  • Information Gathering
  • Phishing
  • SQL Injection
  • Vulnerability Scanners
  • Wifi Hacking
Menu
  • Anonymity
  • Bruteforce
  • DoS – Denial of Service
  • Information Gathering
  • Phishing
  • SQL Injection
  • Vulnerability Scanners
  • Wifi Hacking
Tools Directory

2014 – 2020 | Haxf4rall.com               Stay Connected:

Facebook Twitter Google-plus Wordpress
Please wait...

Join Our Community

Subscribe now and get your free HACKERS HANDBOOK

Don't Worry ! You will not be spammed
SIGN UP FOR NEWSLETTER NOW