cURL tutorial – find remote broken links

0
Find remore broken links wit cURL
Photo by israel palacio on Unsplash

In this tutorial I will show you how to find broken links on remote servers. If you work with productfeeds of affiliate networks, there might be campaigns that have stopped or products, that are still appearing in the productfeed, but lead to a 404 – not found pages.

Using cURL you can loop trough all urls, find broken links or links, that lead to a particulair url (for stopped campaigns) and delete those products form your database.

If your hosting provider supports cron jobs, you can set a wp_cron job, that is performing check, and deleting all broken links automatically on daily basis.

Using cURL we will loop trough all urls, get information about last effective url – CURLINFO_EFFECTIVE_URL and if it leads to a 404 – not found page or to a particulair url, we will delete those products from the database.

In this example, I will work with TradeTracker productfeeds. If a campaign had stopped, user is redirected to below url: https://static.tradetracker.net/int/international/jump.html

See below function. It should be placed in your functions.php, or in a plugin. You can use AJAX to activate broken link search on click of a button, or a wp_cron job, which will do this for you automatically for you.

function productfeed_trash_broken_links() {
$urls_arr = array();
$broken_links = array();
$all_product_ids = get_posts ( array( 'post_type' => 'pa_products', 'posts_per_page' => -1,'fields'=>'ids'  ) );
foreach($all_product_ids as $product_id) {	
	$pa_product_url = get_post_meta($product_id, 'PA_products_affiliate_url', true);  
		array_push($urls_arr,$pa_product_url);
}

$batch_of = 200;
$batch = array_chunk($urls_arr, $batch_of);
foreach($batch as $chunk) {
		
		$nodes = $chunk;
		$mh = curl_multi_init(); 
		$curl_array = array(); 
		foreach($nodes as $i => $url) { 
		$curl_array[$i] = curl_init($url); 
		curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true); 
		curl_setopt($curl_array[$i], CURLOPT_SSL_VERIFYPEER, false); 
		curl_setopt($curl_array[$i], CURLOPT_SSL_VERIFYHOST, false);
		curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true);
		curl_setopt($curl_array[$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); 
		curl_setopt($curl_array[$i], CURLOPT_HEADER, true);
		curl_setopt($curl_array[$i], CURLOPT_NOBODY, true);
		curl_setopt($curl_array[$i], CURLOPT_BINARYTRANSFER, true);
		curl_setopt($curl_array[$i], CURLOPT_FOLLOWLOCATION, true);
		curl_setopt($curl_array[$i], CURLOPT_CONNECTTIMEOUT, 30);
		curl_setopt($curl_array[$i], CURLOPT_TIMEOUT, 30);
			curl_multi_add_handle($mh, $curl_array[$i]); 
		} 
		$running = NULL; 
		do { 
			usleep(1000); 
		   $status = curl_multi_exec($mh,$running); 
		} while($running > 0 && $status == CURLM_OK); 

		$res = array(); 
		foreach($nodes as $i => $url) { 
			$res[$url] = curl_multi_getcontent($curl_array[$i]); 
		$redirect_url = curl_getinfo($curl_array[$i], CURLINFO_EFFECTIVE_URL);
		
		$httpCode = curl_getinfo($curl_array[$i], CURLINFO_HTTP_CODE);
			if(!empty ($redirect_url)) {
				
			$redirect_url_path = parse_url($redirect_url, PHP_URL_PATH);
				//if url leads to stopped campaign
				if( $redirect_url_path === "/int/international/jump.html"){
					array_push($broken_links,$url);
				}
			}
			//if url is 404 not found
			if(!empty ($httpCode)) {
				if( $httpCode === 404){
				array_push($broken_links,$url);
				}
			}
		} 


		foreach($nodes as $i => $url){ 
			curl_multi_remove_handle($mh, $curl_array[$i]); 
		}    
		curl_multi_close($mh);     
		
	}
	sleep (5);

		foreach($all_product_ids as $product_id) {	
		$pa_product_url = get_post_meta($product_id, 'PA_products_affiliate_url', true);  
			if (in_array($pa_product_url, $broken_links, true)) {
				wp_trash_post( $product_id );
			}
		}
	wp_die(); 
}

First we will get all affiliate urls and push them in an array, in order to be able to process them in batches. We have another array, where broken links will be pushed.

We need to loop trough all post IDs, get post meta – affiliate url and push all urls into the $urls_arr() array.

Later on we chunk the $urls_arr() into batches of 200, in order not to get out of memory limit and for each chunk we use curl_multi_init(), which allows us to process multiple cURL handles asynchronously.

Next we loop trough all nodes, set options, using curl_setopt() and perform curl_multi_exec to process handles in stacks. CURLINFO_EFFECTIVE_URL will give us the last effective url, after all redirects.

After that we use parse_url to get the PHP_URL_PATH of each url. If url path is equal to “/int/international/jump.html”, this url will be pushed into the $broken_links() array.

Earlier we got information from cURL about the retturned http code, using CURLINFO_HTTP_CODE. If the retturned http code equals 404, urls will be pushed into the $broken_links() array as well.

Now that we have all 404 – not found pages and urls of stopped campaigns in our $broken_links() array, we can loop trough all products/posts and if the affiliate url exists in the array of broken links, post will be trashed using wp_trash_post().

You can also use wp_delete_post(), with second parameter
$force_delete set to true, if you wish to delete those products permanently.

If this tutorial was valuable for you and you wish to get notified when a new tutorial comes out, enter your name and email address below.

Where I can send tutorials from now on?

( I hate spam too, so I will never send such to you! )