Join 3,374 readers in helping fund MetaFilter (Hide)


Have paragraphs. Want just the hyperlinks. In Excel.
May 19, 2010 8:10 AM   Subscribe

Need list of trade publications. Found huge one on Ulrich's web. How can I remove all information except for the (hyperlinked) titles, ideally in a way that keeps the titles on separate lines so I can then paste them into Excel?

A custom search on Ulrich's web gave me a list of almost 1,300 English language trade publications, circulation between 50,000 and 9,999,999 (their max). Each publication listing contains additional information (publisher, country, ISSN, start year, status, price). Though it looks sort of like a chart online, I can't highlight by column, and when I copy it into MS Word 2003, I get a sort of paragraph thing. Example below (the [ ] is a checkbox, and the first part (title) is a hyperlink):

[ ] AAII Journal
American Association of Individual Investors
United States
0192-3315
1979
Active
See Full Record
AAP News
[ ] American Academy of Pediatrics
United States
1073-0397
1993
Active
USD 138.0

There are also sometimes little icons between the checkbox and the the title. To go through 1297 publications and remove all the additional information seems a huge waste of time. I'm convinced there's a more efficient way, but I can't figure out what it is and Google hasn't helped.

I thought of pasting it into Excel directly and using macros to remove 6 out of every 7 lines, but it's probably a bad sign that even using "Clear Contents" on the entire sheet does not remove the checkboxes or icons.

Again, the titles are the only hyperlinked text, and I want to keep them on separate lines. The goal is to end with a list of all 1297 titles in an Excel column and nothing else.

Finally, I've tried making a list on Ulrich's web of the titles I want but that format doesn't look any more workable. Does anyone who is familiar with the site know a way to manipulate the data there in a way that would make this easier?
posted by randomname25 to Computers & Internet (10 answers total)
 
The checkboxes and icons are probably objects. In Excel, hit F5 (for Go To), hit the "Special..." button and select Objects. This will select all the objects on the sheet. Then hit Delete once.
posted by soelo at 8:22 AM on May 19, 2010


I tried your suggestion; it worked for the icons but not the checkboxes.
posted by randomname25 at 8:40 AM on May 19, 2010


Have you tried Alt-Select to select the text in the "titles" column only on the original html table? It could work depending on how the table is formatted.
posted by prenominal at 8:44 AM on May 19, 2010


Can you select the cells you need without selecting the checkboxes? If so, you could copy and paste the cells into a new workbook.
posted by soelo at 8:45 AM on May 19, 2010


Alt-Select didn't work. Weirdly, when I paste into Excel, some of the checkboxes are on the line preceding the title, and others are in the same cell.
posted by randomname25 at 8:52 AM on May 19, 2010


I did a custom search on Ulrich's, selected "all" titles, added to a "new list". Then clicked on the "lists" link, chose my list and clicked on "print". In the print options, I chose only the title and the URL. The output was a text file with the title and the URL, which you should be able to paste into Excel. Hope this works for you!
posted by prenominal at 9:03 AM on May 19, 2010


I forgot to mention that you can also choose "title" alone first on the list and print, paste it into Excel, then go back, choose "URL" only and paste it again. This would give you the title and URL side by side with the record #. Then delete extra columns, sort to group all non-blank rows together etc.
posted by prenominal at 9:08 AM on May 19, 2010


For removing all lines EXCEPT hyperlinks, you could have a macro roll through the whole page and delete non-linked lines.

The key code to identify lines that are linked is (this will popup if there's a link there):
If Selection.Hyperlinks.Count > 0 Then
    MsgBox ("linked!")
End If
So this code should just leave you with cells with links, and a ton of blank rows:
Sub RemoveAllButLinks()
'
'Assumes the data is in Column A. If it's not, change j to the number
'of the column with the data (A=1, B=2, etc)

Dim i As Integer
Dim j As Integer
Dim lastRow As Integer

j = 1 ' this is the column with the data
i = 1 ' set this to the first row with data
lastRow = ActiveSheet.Cells.SpecialCells(xlCellTypeLastCell).Row

For i = 1 To lastRow
    
    If Cells(i, j).Hyperlinks.Count = 0 Then
        Cells(i, 1).Value = ""
    End If
Next i

End Sub

posted by inigo2 at 9:13 AM on May 19, 2010


Oops -- within that IF statement, it should be "Cells(i, j)", not "Cells(i, 1)" (though if the data is in column A, then either one works).
posted by inigo2 at 9:14 AM on May 19, 2010


Thanks, people!
posted by randomname25 at 6:55 AM on May 25, 2010


« Older I'm honeymooning at a luxury r...   |  I need hors d'oeuvre-y ideas f... Newer »
This thread is closed to new comments.