{"id":3782,"date":"2024-04-07T12:43:28","date_gmt":"2024-04-07T17:43:28","guid":{"rendered":"https:\/\/osric.com\/chris\/accidental-developer\/?p=3782"},"modified":"2024-04-07T12:43:28","modified_gmt":"2024-04-07T17:43:28","slug":"extracting-links-from-google-sheets","status":"publish","type":"post","link":"https:\/\/osric.com\/chris\/accidental-developer\/2024\/04\/extracting-links-from-google-sheets\/","title":{"rendered":"Extracting links from Google Sheets"},"content":{"rendered":"<p>I was working with a shared Google Sheet at work and ran into this:<\/p>\n<figure id=\"attachment_3788\" aria-describedby=\"caption-attachment-3788\" style=\"width: 622px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/osric.com\/chris\/accidental-developer\/2024\/04\/extracting-links-from-google-sheets\/google_sheet_links\/\" rel=\"attachment wp-att-3788\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/osric.com\/chris\/accidental-developer\/wp-content\/uploads\/2024\/04\/google_sheet_links.png\" alt=\"An excerpt of a Google Sheet. Each row contains a cell with a hyperlink labeled Link\" width=\"622\" height=\"496\" class=\"size-full wp-image-3788\" \/><\/a><figcaption id=\"caption-attachment-3788\" class=\"wp-caption-text\">An excerpt of a Google Sheet. Each row contains a cell with a hyperlink labeled Link, but the actual URL is not displayed.<\/figcaption><\/figure>\n<p>I get it, URLs can be long and messy. We want narrow columns that look clean, not cluttered. But I wanted to analyze the URLs and search for certain content and patterns, which were hidden from me behind the link text.<\/p>\n<p>How can I extract all the URLs?<br \/>\n<!--more--><br \/>\n<strong>Brute force?<\/strong><br \/>\nI could get each link one at a time, by selecting each cell and using the link copy function in Google Sheets. But that would be extremely tedious and boring.<\/p>\n<p><strong>Can paste-with-formatting actually help?<\/strong><br \/>\nI tried copying-and-pasting the data into a variety of applications that I thought might be link-aware, but my results always ended up with just the link text and no URLs.<\/p>\n<p><strong>StackOverflow to the rescue?<\/strong><br \/>\nThe community provided plenty of custom formulae in <a href=\"https:\/\/stackoverflow.com\/questions\/35230764\/how-to-extract-url-from-link-in-google-sheets-using-a-formula\">How to extract URL from Link in Google Sheets using a formula?<\/a><\/p>\n<p>But I wasn&#8217;t interested in learning about creating a custom Google Sheets formula. Maybe that is something I should learn more about, but I just wanted a quick fix.<\/p>\n<p><strong>Can AI tools help?<\/strong><br \/>\nI admit, I logged into Microsoft Copilot to see if it could handle this. It confidently told me that it could, but then it failed every time with complaints about every file format I used to upload the data: Excel, HTML, PDF. (I was using test data, not actual work data, in case you are wondering how foolhardy or cautious I am about sharing data with LLM tools.)<\/p>\n<p><strong>A sed and grep pipeline solution<\/strong><\/p>\n<p>1. Copy just one target column to a new Google Sheet<br \/>\n2. Download the new Google Sheet as HTML<br \/>\n3. Open the resulting ZIP file and extract just the HTML, e.g. <code>Sheet1.html<\/code><br \/>\n4. Run this slightly ridiculous pipeline:<\/p>\n<pre><code>sed 's\/&lt;td\/\\n&lt;td\/g' Sheet1.html | # insert newlines before cells\r\ngrep '&lt;td' | # select just the lines with data cells\r\nsed 's\/&lt;td[^&gt;]*&gt;\/\/g' | # remove the data cell before the link\r\nsed 's\/[&lt;]a target=\"_blank\" href=\"\/\/g' | # remove code before links\r\nsed 's\/\".\\+\/\/' | # remove everything after the link\r\nsed 's\/&lt;.\\+\/\/' &gt;raw_links.csv # replace non-link rows with blanks<\/code><\/pre>\n<p>5. Upload <code>raw_links.csv<\/code> to Google Drive<br \/>\n6. Open <code>raw_links.csv<\/code> in Google Sheets<br \/>\n7. Copy column A, and paste into a new empty column in your target Google Sheet<br \/>\n8. Replace the &#8220;Link&#8221; column with a formula, e.g.<\/p>\n<pre><code>=IF(ISBLANK(L2), \"\", HYPERLINK(L2, \"Link\"))<\/code><\/pre>\n<p>9. Highlight the calculated &#8220;Link&#8221; column<br \/>\n10. Go to <em>Data &#8212; Protect Sheets and Ranges<\/em> to help prevent your colleagues from hiding useful URLs behind link text!<\/p>\n<p>Obviously some of those steps are optional, depending on your goal. I think it is useful to preserve the URLs in an easy-to-access format, but at the same time I understand that it does make the spreadsheet more cluttered.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was working with a shared Google Sheet at work and ran into this: I get it, URLs can be long and messy. We want narrow columns that look clean, not cluttered. But I wanted to analyze the URLs and search for certain content and patterns, which were hidden from me behind the link text. &hellip; <a href=\"https:\/\/osric.com\/chris\/accidental-developer\/2024\/04\/extracting-links-from-google-sheets\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Extracting links from Google Sheets<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[232],"tags":[577,293],"class_list":["post-3782","post","type-post","status-publish","format-standard","hentry","category-tips-tricks","tag-google-sheets","tag-sed"],"_links":{"self":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/3782","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/comments?post=3782"}],"version-history":[{"count":14,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/3782\/revisions"}],"predecessor-version":[{"id":3800,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/3782\/revisions\/3800"}],"wp:attachment":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/media?parent=3782"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/categories?post=3782"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/tags?post=3782"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}