Extracting links from Google Sheets

I was working with a shared Google Sheet at work and ran into this:

An excerpt of a Google Sheet. Each row contains a cell with a hyperlink labeled Link
An excerpt of a Google Sheet. Each row contains a cell with a hyperlink labeled Link, but the actual URL is not displayed.

I get it, URLs can be long and messy. We want narrow columns that look clean, not cluttered. But I wanted to analyze the URLs and search for certain content and patterns, which were hidden from me behind the link text.

How can I extract all the URLs?

Brute force?
I could get each link one at a time, by selecting each cell and using the link copy function in Google Sheets. But that would be extremely tedious and boring.

Can paste-with-formatting actually help?
I tried copying-and-pasting the data into a variety of applications that I thought might be link-aware, but my results always ended up with just the link text and no URLs.

StackOverflow to the rescue?
The community provided plenty of custom formulae in How to extract URL from Link in Google Sheets using a formula?

But I wasn’t interested in learning about creating a custom Google Sheets formula. Maybe that is something I should learn more about, but I just wanted a quick fix.

Can AI tools help?
I admit, I logged into Microsoft Copilot to see if it could handle this. It confidently told me that it could, but then it failed every time with complaints about every file format I used to upload the data: Excel, HTML, PDF. (I was using test data, not actual work data, in case you are wondering how foolhardy or cautious I am about sharing data with LLM tools.)

A sed and grep pipeline solution

1. Copy just one target column to a new Google Sheet
2. Download the new Google Sheet as HTML
3. Open the resulting ZIP file and extract just the HTML, e.g. Sheet1.html
4. Run this slightly ridiculous pipeline:

sed 's/<td/\n<td/g' Sheet1.html | # insert newlines before cells
grep '<td' | # select just the lines with data cells
sed 's/<td[^>]*>//g' | # remove the data cell before the link
sed 's/[<]a target="_blank" href="//g' | # remove code before links
sed 's/".\+//' | # remove everything after the link
sed 's/<.\+//' >raw_links.csv # replace non-link rows with blanks

5. Upload raw_links.csv to Google Drive
6. Open raw_links.csv in Google Sheets
7. Copy column A, and paste into a new empty column in your target Google Sheet
8. Replace the “Link” column with a formula, e.g.

=IF(ISBLANK(L2), "", HYPERLINK(L2, "Link"))

9. Highlight the calculated “Link” column
10. Go to Data — Protect Sheets and Ranges to help prevent your colleagues from hiding useful URLs behind link text!

Obviously some of those steps are optional, depending on your goal. I think it is useful to preserve the URLs in an easy-to-access format, but at the same time I understand that it does make the spreadsheet more cluttered.

Leave a Reply

Your email address will not be published. Required fields are marked *