07-19-11 - The internet just doesn't work

Joe Duffy's page at bluebyte is down or gone. (has been for weeks).

Which reminds me that the internet just doesn't work.

I mean, as a way of stealing money from stupid people, it works great, which I suppose is really what the people behind the modern interent are really interested in. But as a way of presenting information in a simple, efficient, permanent, archivable format, it's shit.

Whenever I go back to one of my posts from a few years ago, which I carefully linked to good info - half the links don't work anymore.

It's just worse than fucking five-hundred-year-old technology (books). I can buy a book and put it on my shelf and it doesn't disappear in the night.

Of course it's even worse if you make fancy pages that use AJAX or Flash (I don't even know what the new widget flavor of the month is) or proprietary formats like PPT or whatever, since that stuff will be a huge pain to keep working 10-20 years from now.

Anyway, pursuant to this I thought I should go and actually download some of my favorite pages. Unfortunately it's much harder than you might think.

Anything at Google Groups is a good example of the problem.

It sure looks like just a bunch of plain text. Oh no. It's running through some kind of crazy Google mumbo-jumbo. If you just use Firefox's "Save Page" you get 600k of shit for that tiny bit of text - and it fails to download it remotely correctly. (but at least it does get the primary text)

If you use HTTrack to try to mirror the whole page, it downloads about 1000k of shit and fails to get a readable page AT ALL.

OMG this should not be so difficult. Text, people, text!


Billy said...

Wait another few more years when URL shorteners have had time to bake a little, then collapse.

cbloom said...

True, there are a lot of sites that are tangle of references and dependencies on other sites and services; one of them goes away and it's fucked.

BTW I forgot another major lockfree site that's gone is Thomasson's original appcore site. He's moved some of the stuff to a new site but lots of old links are dead.

Branimir Karadžić said...

Totally agree, but there is solution. You have to get Google Reader, then you can use note in reader to copy full content of the page into your Google Reader Notes for later. You can also do backup of your reader items thru atom feed that comes with shared items. You have to do backup regularly but it goes back approx 2 years. Note In Reader removes all crap formatting, and leaves it almost in plain text. I don't even surf web anymore, everything comes to me thru RSS/Atom into Google Reader. The best part is that's all searchable (only not as fast as regular search).

See my Google Reader shared items here:

Btw, why are you not on G+?

jfb said...

You might consider setting up your site to link everything to archive.org with the date of the post. IIRC that's vaguely what I did when I created mud-dev.zer7.com.

But yeah, sites with scripting all over the place are screwed.

cbloom said...

Branimir, can you forward non-RSS pages into Reader?

I've seen web-to-rss routers, but it's way to klunky for daily use; is there a "put this in Reader" that just works on normal web pages?

Branimir Karadžić said...

If you're asking for RSS notifications from web pages that don't provide RSS or Atom, I'm using http://page2rss.com/ (Google Reader had this feature but they discontinued it for some unknown reason?!). But in my case pages that I find that don't have RSS notification are rare and mostly are academic web pages with white papers. Page2Rss does decent job, since it shows diff between changes, but it's not as good as real RSS.

When RSS provides only short text or title only I use Note In Reader feature to copy full text into the reader. Even if article is multi-page you can make it be one page in reader.

Darius Kazemi said...

http://pinboard.in/ is a web bookmarking service, but for an additional yearly fee they archive every page you bookmark. I believe you can download these archives.

