A Common Interesting Problem – Getting All the URLs from a Domain
There's a tiny little problem that affects all web designers, web developers, and even business owners if they're involved with their digital marketing efforts.
The problem is getting a list of URLs from a website.
Seems simple enough, but it isn't. There is a dozen type of URLs you may want to catalog, regular web pages, images, and media files. Then there are private and public URLs.
The easiest way to get a list is to get the sitemap.xml file. But not all websites have one, and it only includes what the content engine or website administrator has decided to add to it.
You could also ask Google what pages it knows from a domain, which is based on index status. So if a particular page isn't indexed, it may not show there.
You could use a custom wget script; there are plenty of those in Github and other people's websites. But they never seem to work right, or they don't do what's needed, or sometimes they do everything except what you need.
Just a simple list of all the URLs on a domain.
Ideally, the solution would have the option to show all public-facing URLs and private URLs, and it would be nice to filter through a type of content that is at the end of the URL, text/HTML, media, CSS, etcetera.
The closest solution I've found for this is a little app called Screaming Frog. It does everything needed to get a list of the URLs of a website. At least it does a good job of getting all the links that are public-facing and available; there's a limit on the free version, but if you do this a lot or with a large website, you may need the paid version.