Forgot your password?
typodupeerror
The Internet

The Anti-Thesaurus: Unwords For Web Searches 148

Posted by timothy
from the these-are-not-the-words-you're-looking-for dept.
Nicholas Carroll writes: "In the continual struggle between search engine administrators, index spammers, and the chaos that underlies knowledge classification, we have endless tools for 'increasing relevance' of search returns, ranging from much ballyhooed and misunderstood 'meta keywords,' to complex algorithms that are still far from perfecting artificial intelligence. Proposal: there should be a metadata standard allowing webmasters to manually decrease the relevance of their pages for specific search terms and phrases."
This discussion has been archived. No new comments can be posted.

The Anti-Thesaurus: Unwords For Web Searches

Comments Filter:
  • Sounds Good But... (Score:3, Insightful)

    by TMacPhail (519256) on Tuesday November 20, 2001 @03:20AM (#2588143)
    This sounds like a good plan but i dont think anyone would be willing to risk having their page show up lower in a search when someone was intending to find it. Plus anyone that finds the page in a search by accident is just a new potential customer.
    • by /Wegge (2960)
      Plus anyone that finds the page in a search by accident is just a new potential customer.

      On the other hand, any potential customer who find the page as a result of a broader match than warranted by the page might also remeber the site as one that doesn't have what he needs. I don't claim to understand mainstream consumerism, but in my professional capacity, I tend to avoid companies that tries to make a followup sale on a completely unrelated issue.

      • > should be a metadata standard allowing webmasters
        > to manually decrease the relevance of their pages
        > for specific search terms and phrases."

        Last time I checked, the problem was stopping XXX BRITNEY NIPSLIP from turning up as the result to "+car +transmission +repair".
    • by Krimsen (26685) on Tuesday November 20, 2001 @03:39AM (#2588182) Homepage
      You are basing this on the fact that all people are consumers and all they are searching for are goods and services. What if I am searching the web for info on the DMCA and someone's webpage was called "DMCA" -short for "David, Michael, Cathy and Andrea" (or whatever) If they find that a lot of people are coming across the page accidentally, they can lower the relevance on the page on searches for "DMCA"...
      • Ok, so this hypotheical "David, Michael, Cathy and Andrea" site might get more hits than they wanted. If they were not intending to sell something then they probably dont actually care about the number of hits that they get. In all likeliness the site is a free site hosted by geocities or some other similar service. In this case it would be considerate of them to use DMCA as a nonword for the meta tag but they would have no responsibility to actually place it as one. For the sites who are trying to gain buisness through the web, they usualy like the extra hits because it creates potential customers. Or at least someone who might happen to mention something they saw to another potential customer.
        • by Krimsen (26685)
          Agreed on all points. I guess this concept of nonwords really is kind of dependent on people putting some effort towards something that doesn't immediately benefit them. Eventually "What goes around, comes around" and if eveyone uses the non-words, searches will become better. However, I'm not so sure that people are willing to put effort into something that they won't see return from right away.
          • Wow..."Flamebait"?

            We've got some real winners modding around here as of late... *sigh*
            • Since this summer (or winter if you're below the equator) there has been a concerted effort to completely f the moderation system. Someone(s) or something(s) are specificly targeting good posts and modding them down. Individual posters have been targeted as well.

              The only thing to do about it is to metamoderate and make sure lame behavior like modding your parent post "Flamebait" get's marked "Unfair"
            • I don't have mod points right now, but has anyone else noticed that if you use a wheel mouse under windows, you do your mod, and then you "wheel down" to click the moderation button. If you don't remember to click away from the mod box, you end up given the poor person a completely different mod than you intended.

              Maybe this is only an Opera issue?
        • If David, Michael, Cathy and Andrea were paying per megabyte for the bandwidth used by their site (for instance if they required what some ISPs consider to be premium services such as ASP or PHP) they would not want everyone who was looking for DMCA information to view their site, since that would most likely more than double their bandwidth consimption. With a frequently searched for word such as DMCA being used as a nonword for their site, they are both saving their own money and the performance of their ISP's network and servers. Another example would be if someone's surname is the same as that of a commercial organisation. They do not want all of that organisation's customers wandering into their site by accident.
    • i dont think anyone would be willing to risk having their page show up lower in a search

      Oh you capitalist-thinkers. Spare a thought for Geocities/ Hypermart users who have to start shelling out money if they cross a certain hit threshold.

  • How about this? (Score:4, Insightful)

    by NitsujTPU (19263) on Tuesday November 20, 2001 @03:27AM (#2588158)
    Just shitlist any site that is obviously reaching for hits? If a porn site has the words "Alan Turing" in its metadata and doesn't mention anything about Turing later in the site, list them as not being allowed to participate in your search.

    Hell, an engine that did that would almost be useful.
    • from webmonkey on search engine foolin' software:

      You can guess why: Search engine developers buy copies of the same software, learn how to recognize its output, and then demote your site or block it altogether when they spot that pattern in your pages.


      no hard "this site was banned" but it seems there are some who do demote/block if they catch you putting garbage in your keyword list.

      PS if any porn site puts 'alan turing' in their keywords I would actually want to go there - shows some imagination to say the least, gotta give them props for that...
    • Re:How about this? (Score:4, Informative)

      by 21mhz (443080) on Tuesday November 20, 2001 @05:48AM (#2588347) Journal
      This is where the Google's PageRank(tm) system chimes in: an Alan Turing biography linked by half a hundred sites, each having own decent ratings, will be rated undoubtedly higher than a porn site that just listed "alan turing britney spears anthrax riaa cowboyneal" in their meta keywords and is linked by a handful among millions sites alike. Use the great cross-linking fabric of the Web, Luke.

      Disclaimer: I'm in no way associated with Google.
  • by Satai (111172) on Tuesday November 20, 2001 @03:28AM (#2588161)
    I can see it now. To Do lists are being written up as we speak...

    1. Increase relevance for Penis Enlargement.
    2. Decrease relevance for Bullshit.


  • by Overcoat (522810) on Tuesday November 20, 2001 @03:31AM (#2588166)
    Is the phenomenon of people naming their website something that has nothing to do with the content of the website so widespread that it necessitites a new metadata tag and the consequent alteration of search engines to recognize it?

    Google seems to do a good enough job of filtering out irrelevant responses as it is.

  • Proposal: there should be a metadata standard allowing webmasters to manually decrease the relevance of their pages for specific search terms and phrases.

    Okay, pretend I'm a webmaster. What's my incentive to have my page show up LESS in anyone's search results?!

    If someone didn't want my site, why do I care if they get it? And if someone wants my site, I don't want to take any chance with an "anti-thesaurus" that might end up excluding my site!
    • by Nate Eldredge (133418) on Tuesday November 20, 2001 @04:29AM (#2588246)
      I work as a sysadmin for a computer science department. Until recently, the system staff would frequently get messages along the lines of

      From: frankie3327@aol.com
      To: staff@cs.here.edu
      Subject: help!

      i have a lexmark 4590 and it wont print in color.
      it only makes streaks. also the paper always
      jams. how do i fix it? please reply soon!

      The senders never had any connection to the college or the department. We'd reply telling them we had no idea what they were talking about, and that they should seek help elsewhere. It was rather annoying.

      We eventually figured it out. The department web site maintains a collection of help documents for users of the systems. One of them talked about how to use the department's printers, what to do if you have trouble, etc. At the bottom it listed staff@cs.here.edu as the contact address for the site.

      You've probably guessed it by now. That page came up as one of the top few hits when you searched for "printing" on one of the major search engines (I forget which one). Apparently lusers would find this page, notice that it didn't answer their question, but latch on to the staff email address at the bottom, as if we were an organization dedicated to helping people worldwide with their printers. Furrfu!

      I think we reworded the page to emphasize that it only applied to the college, and we haven't received any more emails lately. But if we could have kept search engines from returning it, that would have been even better. Since in our case the page was intended for internal use, we don't care whether anyone can find it from the Internet. Our real users know where to look for it.

      So in answer to your question: When a search engine returns a page that doesn't answer the user's question, the user will often complain to the webmaster. That's a clear incentive to the webmaster not to have the page show up where it's not relevant. Also, it's not the goal of every site simply to be read by millions of people; some would rather concentrate on those to whom it's useful.

      • by Ex Machina (10710) <jonathan.williams@noSpAM.gmail.com> on Tuesday November 20, 2001 @05:17AM (#2588312) Homepage
        But if we could have kept search engines from returning it, that would have been even better. Since in our case the page was intended for internal use, we don't care whether anyone can find it from the Internet. Our real users know where to look for it.

        http://www.robotstxt.org/wc/exclusion.html [robotstxt.org]
      • robots.txt ? (Score:3, Informative)

        by Atrax (249401)
        did you have the page disallowed for search engines? if something is for internal use only, you really ought to have dropped in a robots.txt to exclude it altogether.

        if more people used robots.txt, a lot of 'only useful to internal users' sites would drop right off the engines, leaving relevant results for the rest of the world...

        just a thought......
        • Re:robots.txt ? (Score:1, Interesting)

          Thats just it though. You say to use robots.txt to have it excluded from search engines but that would exclude it all together. With this new metatag it would only have excluded search engine from returning the page for say a search on "printers" but still return a result for " tech support" If that indeed is what the page was intended for. I think this is a great idea.
          • missing the point. the post talked about *internal* pages - this isn't a page that should really even be looking for a search engine listing, really, apart from perhaps some altruistic urge. an outside user should end up at the real support site, not a university CS dept., which is *just* for the university. it's a pretty big issue, and i was only pointing at a small part of it...

            j
        • The only problem is that the site WAS helpful to people other than internal users. Third-party troubleshooting information is often a useful resource, particularly for older hardware. Publishing information like this where any ol' luser can find it is about the same a releasing Free software; it doesn't make much more work from you, but it helps the community.

          And, if the admin had a clue, a simple "WTF did you get my addy?" emailed to joe6paq@aol.com would probably have explained everything.
    • What's my incentive to have my page show up LESS in anyone's search results?!


      Saving bandwidth, perhaps? For a hobbyist's website hosted cheaply (and thus having a low transfer limit), it might be quite desirable not to attract too many visitors who aren't actually interested in the site's contents. Of course, that's not a very common scenario, good search engines will give such sites a low priority anyway because they're not linked to very often.

    • by Anonymous Coward
      It's even worse than a lack of incentive to decrease relevance. There's actually a strong incentive not to: advertising.

      CPM ads pay the same regardless of relevence. CPC ads tend to pay *even more* for visitors who aren't interested in your content, since they're more likely to click on the ad on the way out.
  • by Dr. Awktagon (233360) on Tuesday November 20, 2001 @03:35AM (#2588173) Homepage
    Well it's not as good/effective an idea as what this fellow is suggesting, but you can have a lot of fun with people based on their Referer fields. for instance, use it to just bounce them back to their queries, or bounce them to a different query (one for porn sites is always fun), or bounce them to a more relevant page, or fuck with them however you like. If you've ever had to set up Apache to block people from linking your images, you already know how to do it.
    • If you've ever had to set up Apache to block people from linking your images, you already know how to do it.

      Can you point me to a good howto?
      • There's a pretty good "howto" thing here [apache-server.com] that should get you started.
      • Well some docs are here [apache.org], and the mod_rewrite reference is here [apache.org].

        Here is a goofy example that does a redirect back to their google query, except with the word "porn" appended to it. As an added bonus, it only does it when the clock's seconds are an even number. (Or do the same test to the last digit of their IP address). Replace the plus sign before "porn" with about 100 plus signs and they won't see the addition because each plus sign becomes a space. The "%1" refers to their original query.

        RewriteEngine On
        RewriteCond %{TIME_SEC} [02468]$
        RewriteCond %{HTTP_REFERER} google\.com/search [NC]
        RewriteCond %{HTTP_REFERER} [?&]q=([^&]+)
        RewriteRule . http://www.google.com/search?q=%1+porn [R=temp,L]

        Here's another one that checks the user-agent for an URL, and then redirects to it. This keeps most spiders and stuff off your pages since they usually put their URLs in the User-Agent:

        RewriteEngine On
        RewriteCond %{HTTP_USER_AGENT} "(http://[^ )]+)"
        RewriteRule . %1 [R=permanent,L]

        Anything you can think of is possible. I think you can even hook it into external scripts.

        • This keeps most spiders and stuff off your pages since they usually put their URLs in the User-Agent:

          Why not just use robots.txt? Either way you're relying on the spider operator to write their bots in a particular way.

  • A bit negative? (Score:2, Interesting)

    by ukryule (186826)
    Wouldn't it be better to put more effort into describing what a site IS about, rather than what it ISN'T?

    After all, if you describe your site, a good search engines will use this information well (so you shouldn't get too many erroneous hits). However, if you list your non-words, a bad search engines will just see this list and treat them as keywords!
  • When I first read this, it seemed like a good idea. However, it quickly dawned on me that this is a solution in search of a problem. How many people are actually complaining about too many hits to their web site?

    Please forgive me for mentioning capitalism on Slashdot, but a website that receives many misdirected hits is perfect for targeted marketing. Think of the possibilities: if your web site is getting mistaken hits for "victor mousetraps," sell banner ads for "Revenge" brand traps and make a killing on the click-throughs. With a little clever Perl scripting, determine which banner ad to show based on which set of "wrong keywords" show up in the referer. Companies will pay a lot of money for accurately targeted advertisements. Selling these ads would undoubtedly pay the whole bandwidth bill and probably make a profit to boot.

    So no, unwords are not necessary. Unless you're running a website off a freebie .edu connection and aren't allowed to make a profit off of it. Otherwise you're just throwing money away.

    ~wally
    • While it may take a leap of logic to want to do this for external search engines, I ran into this problem when building the search engine for our e-commerce site.

      At first we just allowed our out-of-the-box search engine package to index our catalog, but the problem we kept running into was the relavance of the results (for example returning VCR stands ahead of an actual VCR when the search was "VCR".)

      So to solve this our merchandizers manually added keywords to each group of products that amounted to a thesaurus. We coded the indexing to place a weighted value for these keywords ahead of the title words, and those ahead of body text.

      It's actually a bigger problem than most geeks realize (as our CEO pointed out.) We were trying to return not just pages that corresponded to the search string, but to the intent of the user. That takes a little more thought on the part of the search engine coders and the implementers.
    • How many people are complaining about too many hits?

      Well, speaking personally, I don't want people arriving at my web site unless they're actually looking for the content that's on it. That's because I pay for bandwidth.

      I also know plenty of people who have web sites for their friends, but have ended up being pestered by online perverts after they ended up in search engine listings.
      • How many people are complaining about too many hits?

        Me, definitely.

        I have a section of my site related to Steve Albini's bands, including Big Black [petdance.com]. I get tons of hits looking for things like:

        • big black boys fucking white girls
        • big black tits
        • big black nigger dick
        • big black dick fuck white pussy
        • fuck me with that big black dick
        • big black women who shit
        • big black nude guys
        • shake that big black ass
        • beautiful big black booty
        • big black asses in a skirt
        • big black asses in London
        • big black booty in leather pants
        • big black rumps
        • first big black cock in her pussy
        • kiss my big black booty

          and my favorite...
        • black men with big black fuck sticks
        Maybe the short answer is that we need a <META KEYWORD="non-porn"> tag.
    • Yes, this isn't something the typical dishonest commercial web site would ever do (the marketing dept. would have a fit), but for an information site, (the type that provides real contest) it would be great. And it would save time for people who were searching for information not products.
      • There's nothing dishonest about targeted advertising. Why do you think you get coupons in the mail for Wonder bread after you've bought a loaf of Butternut with your supermarket discount card? (Although the practice can sometimes raise privacy concerns, it doesn't in the "victor mousetrap" case.)

        Why would anyone want to pay for their bandwidth if they could easily get commercial sponsors to pay for it?

        ~wally
  • by ahoehn (301327) <andrew@@@hoe...hn> on Tuesday November 20, 2001 @03:37AM (#2588179) Homepage
    Not such a bright idea to whine about too much traffic on your website and then get a link to your site from a slashdot article.
  • If I think that this is just a retarded stupid idea.

    The people whose web pages are being thrusted to the top of the query lists are the people who are polluting the metadata and other tags for the sole purpose of getting their sites higher in the search lists

    So lemmy get this straight: you want all good and honest people (who aren't causing the problem in the first place) to opt-out of common searches (which they'd never want to do), and this will thus remove the legitimate entries from the pool of queries, returning an even more polluted list from your search engine.

    am I missing something here?

    Although there are a few people who would be helped by removing absolutely irrelivant queries, the vast majority would actually suffer if they used this.
    • No, he wants them to opt out of searches that they know have no relevance to the content, and where they know that they users who get there will just get annoyed and go somewhere else anyway. For people trying to make money on the web, this is a way to reduce bandwidth costs, and to be able to better target people actually interested in what they provide (and thus more willing to pay or click on ads).
  • when it realizes that all the TERRORISTS have to do is put the following bit in their HTML: to conceal their web-based activities....
  • Better Metadata (Score:4, Interesting)

    by nyjx (523123) on Tuesday November 20, 2001 @03:48AM (#2588191) Homepage
    While the idea would probably do some good if widely adopted what's really needed is to reduce the need for text based indexing of web sites but increasing the amount of explict semantic information about its content.

    Marking up pages with information about the meaning of the terms on them is the main thrust of the work on semantic web - see http://www.daml.org/ [daml.org] (for DAML - the DARPA Agent Markup Language), http://www.semanticweb.org/ [semanticweb.org] (One of the main information sources) and finally the new W3C activity on the subject: http://www.w3.org/2001/sw/ [w3.org].

    How far, how fast it will go is another matter but there's certainly a lot of interest in creating a more "machine readable" web.

    • It seems to be a chicken-and-egg situation at the moment -- I'm doing quite a lot of work producing Dublin Core [dublincore.org] metadata in XHTML and RDF format for a content management system, however no search engines yet support the indexing or searching of this metedata.

      When they do then a proposal like this might make (some) sense.

      • I think the semantic web effort has the same problem - no incentive to mark up if there are no search engines / agents to read the stuff. No incentive to build the agents if it isn't out there.

    • The problem is not so much to understand the content of a page. That can be done in many instances. It is not that hard to understand if a page is talking about a river "bank" or a money "bank". Usually there are enough quotes and links within the page to allow for this automated differentiation.

      The real problem is at the other side, when the user fires Google and enters the standard 2-4 query terms "bank australia". There is a lot less information there for a computer to decide that the user is looking for a bank in Australia.

      Metadata on the web pages is pretty much useless for understanding what the user wanted.
  • search issues (Score:2, Interesting)

    by jahjeremy (323931)
    Tbe problem stems from the basic lack of data tagging standardization on the internet. HTML is formative rather than indicative of the types of data that are present. While META keywords are useful, validation is a problem using this method, given the huge number of pages and the propensity of some webmasters to fill this section with irrelavent garbage.

    The main power technique, at least on google, is utilizing quotes and AND/OR to limit search results. Rather than spewing a line of text, enclosing specific "phrases" often gives more accurate results.

    Then again, I have been able to simply cut n' paste error messages into the groups.google.com form and immediately receive accurate, useful hits. I think that though the internet and webpages and generally disorganized and uncentralized, an outside entity can impose order given enough bandwidth, time, energy and intelligence. In the future, web services, probably based on CORBA and SOAP, will allow sites to return messages to searchers or indexing services, thus doing away with a lot of the mystery in the current system.

    All that said, I have had excellent luck with google finding about 95% of all the information I have searched for in the past couple months, showing that a well-written spider and intelligent classification and rating can circumvent the problem of so much untagged, nebulous information.

    The internet is something like the world's largest library where anyone can insert a book and random organizers may (if they wish!) go through and make lists, hashes and indexes of the information for their own card catalogs. Right now, each search service maintains its own separate list! The crawler is like a super-fast librarian who can puruse the book. The coming paradigm will be fewer, more accurate and useful catalogs along with books that "insert themselves" into these schemes intelligently and discretely after a validation of informational content.

    • Google does well because it pays attention to the text *inside the hyperlink to pages*. For example, this link for news for nerds, stuff that matters [slashdot.org] means that google searches on news, nerds, stuff, and matters, are more likely to show /.

      Once you've thrown out the 'click here' and 'this link' junk, this is far more reliable than using meta tags, and often more reliable than looking for keywords within the page itself.

    • "While META keywords are useful, validation is a problem using this method, given the huge number of pages and the propensity of some webmasters to fill this section with irrelavent garbage."

      Search engines should reduce the relevance of pages with huge META sections.
  • My friend found that one of the highest things people were finding his webcomic by was "Digimon Porn"... And his comic has no "digimon" or "porn" about it...
    • by pen (7191)
      my site [geekissues.org] gets at least 50 hits a month from searches for "swedish porn". also, "amputated penis", "charcoal underwear", and "president bush daughter".
    • Ah, the joys of analog [analog.cx]. I regularly look though my log files for interesing stuff. Stuff people have been looking for and finding my web site [stacken.kth.se] (not as perverted as indicated) include:
      • "Long fingernail and long toenail fetish"
      • "mime nude photos"
      • "16 year old boys whith arm pit hair"
      • "easy and fast directions to make crack cocaine in the microwave"
      • "but she was my student why did i have impure thoughs"
      • "nude cartoons inspector gadget"
      • "secrets on how to suntan through your computer"
  • With all the terrabytes a day coming into the Wayback Machine (http://web.archive.org), plus the tons and tons of stuff they have from ancient times (as far back as 1996!) it would be awsome of it was searchable. Even some kind of mundane type of search. Sure, Google's index is great, but this blows Google way out of the water. I've found sites in there I made in middle school and never wanted to see again, but data is data.
  • by pen (7191)
    If I'm searching for something and the wrong sites come up, I simply look for a keyword that is present on most of the sites I don't need that wouldn't be present on the sites I do need, and then add it to the exclusion list.

    For example, if I'm looking for info on a Toyota Supra and too many Celica-related pages come up, I'll type:

    toyota supra -celica

    On a related note, does anyone feel that Google's built-in exclusion list of universal keywords (a,1,of) is really aggravating when Google excludes those words in phrases?

    • That is completely different.

      The suggestion was intended to tell the search engines what words on your site aren't relevant for search purposes. So a site primarily about Toyota Celicas, but that mention Supra a couple of places might want add Supra to their "nonwords" entry, to avoid confusing people looking for info about Supras.

      So if the suggestion were in use by most people, you might not have to add "-celica" to your search, as it would be easier for the search engine to exclude pages that contain the word "Supra" but that isn't relevant for your search.

      It's in no way a perfect idea. But if enough people use it it may have some value.

  • by Rosco P. Coltrane (209368) on Tuesday November 20, 2001 @04:16AM (#2588226)
    If you replace <meta="keywords" content="mickey mouse"> by <meta="nonwords" content="bestiality mouse-fucking zoophilia kinky ....>, you might draw more Disney lovers and less perverts to your site, but I suspect your HTML file will grow quite a lot bigger ...
    • If you replace <meta="keywords" content="mickey mouse"> by <meta="nonwords" content="bestiality mouse-fucking zoophilia kinky ....>, you might draw more Disney lovers and less perverts to your site

      Mommy, what does "view source" mean, and why is the computer swearing at me?

    • Yes, unless the same Disney lovers use filtering software, which probably won't be incredibly impressed by the number of banned words in your HTML...
  • I can understand the author of the proposal, but I'm afraid that his proposal won't help the usual web searcher.

    So I would suggest that he could think about checking the refferer as this site [internet.com] is showing and maybe directs all users that come from a search engine to a page where he offers a search engine that is limited to his site. Since the referrer also includes the whole search string he could maybe even use it to fill out his search form.

    I would even prefer this method because it often happens to me that I enter a site via link from a search engine and then I find out that the result page is just a part of a frameset and its missing properties like Javascript variables. If I would redirect search engine users to a defined starting point on my site they would have less troubles (Don't start a disscussion about the sense and use of frames here :-) )

  • ... you could just get people to switch to Google [google.com] instead.

  • On my idea notepad I said this:

    "Technique to negate words in a document for increased searching. For instance, include files that cause a phrase like 'How we converted to XHTML 1.0' to show up on every page. Only the page with actual information, should show up in search, not every page with the include file."
  • I use filenames all the time on google to find what I want. Sometime's I get lucky and find the file in a directory, with many other files related to the files I am looking for. Another added bonus is I don't have to wade through annoying banner ads or popup windows.
  • Given a particular word on a particular website, it's fairly easy to decide if it's relevant or not. How? By looking for links to that website from other websites which mention the same word. That's the idea behind Teoma [teoma.com] and a number of other search algorithms. Sites which "unintentionally" get hits for unrelated topics simply don't register on these engines. Link analysis provides much more accurate metadata, because it's based on other people's opinions.

    Another problem with metadata in general, of which spam is but one symptom, is the fact that creators of content often have no idea of how their content appeals, or fails to appeal, to other people. Did Mahir have any idea that his name would become a top-ranked search term? Does anyone have any idea how his content should be ranked for a given search term (besides number one, of course)?

    What is the number one piece of metadata found in spam messages? This is not spam.
  • Domain names (Score:1, Offtopic)

    by Breace (33955)
    On a related subject, I've been looking for a domain name that is a) easy to remember and b) does not generate a zillion hits if you type the name in a search engine. (and c) is not a silly long string of words).

    It's funny how most people thing that common word domains are valuable, but forget that if you have a name that, when typed into a search engine, jumps out as the only result is pretty valuable too. Especially if it sounds like it is spelled.

    Maybe not the best example, but since the 4 letter TLD's are practically all gone, I was going to register duxo.com. Unfortunately one of the many domain hogs got it the day I was going for it. :o(

    I got an other one though, but it's not up yet so I won't tell what it is! ;o)))
    • You know, I've noticed this kind of behaviour before. It's too coincidental to attribute to chance. I suspect that there are people who monitor what domain names people are querying and then registering them in the hopes of reselling them. Does anyone know about this?

      t.

      • IANAL and I don't have specific knowledge of this occurring, but really, what's to stop it from happening?

        My suggestion to anyone is that they develop three good domain names that they would be happy with. But for god's sake, do it *offline*! Don't search for them, don't try them in your browser, and don't tell anyone what they are. *Then* just go register one or all of them. Don't wait, don't search, and don't even breathe until they're yours.

        Oh, and don't forget to trademark the language in those URLs (can't be plain English remember). If someone sees your new URL and likes it, they could register the TM if you don't. Then they can sue you for ownership of the domain, since you're clearly infringing on their TM; and they'll probably get the domain in the end.

        Hey, I don't make the rules...

        And my favorite word today is don't.
  • More hits is almost NEVER a bad thing for a site's main purpose (getting people to see it, and hopefully take an interest in what's there)

    For just the same reason as the automotive industry has made clean fuel vehicles standard, and the very way our capitalist world operates. For the time (money) it takes to implement this thing to make the world a better place, the costs can not be substantiated. Granted, if a lot of sites did this, there would be more time for everyone to spend playing with their dog rather than dig through irrelevant search results. But Joe webmaster's company is never going to pay him to do it, and he's not going to spend his free time doing it when he could be spending time with his dog.

    That's the way the world is working right now, and people who want to change the world to a better place will probably spend their time doing other things rather than putting unwords in their web documents.
  • by vectus (193351) on Tuesday November 20, 2001 @05:29AM (#2588326)
    Webmasters, however, should be careful with these new "anti-words", as when they mix with their word counterpart, a gigantic explosion results.
  • by dun0s (213915)
    Porn sites who promote (through a variaty of means) the words "free, porn, sex" and the like and then demote "pay, fee, membership, credit card".

    This proposal will not make the indexing of sites more reliable. If anything it will add to the common confusion associated with meta keywords. Yes it is quite a nice idea in theory but I can't see anyone wanting to exclude words from being searched. The main point in the proposal was that the author felt guilty about pulling in people who had entered search terms that appeared on his page. One would ask why he is publishing information on the internet if he doesn't want people to look at it. A better solution would be to get people to use search engines properly. As an example I will use the stalking on the internet term. If people put these words into google and come up with his page then prehaps they should have modified their query to something like "stalking on the internet" and they may not have found his page. On the other hand if his page contains the phrase "stalking on the internet" it migh be just what the seaker was looking for.

    To this proposal I say nay. or prehaps oink.
  • The Semantic Web (Score:5, Interesting)

    by mike_sucks (55259) on Tuesday November 20, 2001 @05:46AM (#2588343) Homepage
    Surely this kind of issue is what Tim Berners-Lee and the W3C is trying to address with the Semantic Web. [w3.org]

    The problem with content on the web today is that while it is perfectly readable by humans, it is incomprenesible to machines. If Tim and Co get their way, and I for one would love to see the Semantic Web catch on, then we can get rid of kluges like the Anti-Thesaurus, HTML meta keywords and the like.
    • Whether you put them in meta elements (keyword, antithesaurus) or in the body of the document, strings by themselves have no meaning, no connection to the concept which they represent.

      Take for example a search for the string tar, which will yield documents containing:
      tar -zxf update.tgz, or cp update.tar update.old, or roofing tar , or jeg tar en øl nu

      Each instance of tar above has a different meaning, but the same spelling. When you get into misspellings, spelling variations, and conjugation, then the actual concept is even harder to associate with a given range of strings.

      Even Google searches are for strings and not concepts, but Google's ranking algorithm [google.com] relies on which pages get the most links from pages that also get the most links. However, you'll still get different results for color vs. colour and tyre vs tire. Because the algorithm only reflects how people have chosen their links, it does, from time [theregister.co.uk] to time [theregister.co.uk] give unusual associations. ;)

    • by Alomex (148003)
      Surely this kind of issue is what Tim Berners-Lee and the W3C is trying to address with the Semantic Web.

      Indeed, but how close are they from achieving anything of significance? Ai has been working on a Universal Onthohology for ages and gotten nowhere.

      The fact that Berners-Lee agree that it would be a "cool thing to have" does not make it any more likely to happen (by the way, TB-L first proposed the semantic web almost five years ago).

    • The problem with the Semantic Web is that humans, in general, write web pages to be readable by humans, not by machines.

      This is not likely to change anytime soon.
    • Re:The Semantic Web (Score:2, Interesting)

      by Zspdude (531908)
      What you're suggesting, is that rather than trying to make machines as linguistically competent as we are, we should instead adjust to fit their convenience. (I'd never have thought I'd see the day that we began to negotiate compromises with machines, but that's offtopic). The problem is, that besides it being very useful and effecient, it would restrict the versatility of our communication, and make surfing a lot less fun. No longer would we ever find great web sites by accident. Where would we be without our great and ambiguous language, which allows us to say: Time flies like an arrow. And yet does not exclude Fruit flies like a banana. Go figure.
      • "What you're suggesting, is that rather than trying to make machines as linguistically competent as we are, we should instead adjust to fit their convenience."

        No, not at all. It's easy to retro-fit a web site with RDF metadata about the content of that site and requires no human-visible changes to the site. Metadata can be stored in HTML meta tags or perferably in seperate RDF description files. None of this effects the way people surf the Web, and unless they have a good browser they won't even know the additional metadata exists.

        In addition, using SW-friendly content in web pages (like strict XHTML, using CSS for all style, use of other XML dialects like SVG, MathML, CML and so on) only lends to machine comprehension while not detracting a single iota from human comprehension.

        It's possible to have web content that is both human and machine comprehsible, but it unfortunately takes a little more effort than making content that is just human readable.
  • A long time ago (in a galaxy far away) I kept a playlist of my radio show. I had one page per month. One month I played Prono For Pyros "Pets" twice. Guess which web page in our department had the highest hit count for the next year...
  • What about !keyword? (Score:3, Informative)

    by Ed Avis (5917) <ed@membled.com> on Tuesday November 20, 2001 @07:22AM (#2588435) Homepage
    I thought we already had this by prefixing keywords with a ! sign. For example, the BSD FAQ [uni-giessen.de] used to have the line:
    Keywords: FAQ 386bsd NetBSD FreeBSD !Linux

    Presumably the same could be done for <meta name="keywords"> in HTML.

  • In some jurisdictions, you get into trouble if a search engine refers to one of your pages when you enter a trademark (and you are not entitled to use that trademark). This way, you could easily tell search engines not to list your pages when such a trademark is present in the query. Complying with court orders wouln't be a major problem any more.

    However, you could show some information if people visit with a certain Referrer header, directing them to more useful pages. This works in the majority of cases, and it doesn't need much cooperation from the search engines.
  • 1. When you search on almost any name of European origin, hundreds of genealogy pages show up. Including -genealogy -rootsweb -descendants only works to a certain extent. Many people would be grateful for genealogy page exclusion.

    2. Some sites have menus on each page listing every topic on the site. You search on a word and get every page in the site returned, including those that mention the topic only in the menu. A tag such as this <nonsearchable> </nonsearchable> surrounding the menus might aid in solving this problem.

  • Unfortunately, these problems are always better solved by stronger search engines. Even though it is several orders of magnitude harder for a search engine to figure out that those things aren't important, it's several orders of magnitude easier to get google to do it than it is to convince 10 million web page maintainers to do it.
    • stronger search engines

      The more traditional search engines (not google?) have protections against sites that do extreme things to get to 1 in the hitlist. They have protections against repeating 1 word a lot of times. (META="sex, sex,sex"). Repeating your "exwords" in the normal meta tag so many times should trigger the search engine "spam alert" and decrease the search relevance.
  • Most web sites don't have meta tags, but most web designers do want their clients to see impressive hit counts in their traffic reports. Ummm, so who thinks web designers are going to take the time and trouble to add a feature that will decrease traffic?
  • there should be a metadata standard allowing webmasters to manually decrease the relevance of their pages for specific search terms and phrases."

    So, in other words... businesses will want to reduce their exposure on the web? I don't think so.

  • This is backwards (Score:2, Insightful)

    by Nick Arnett (39349)
    Picking out the "irrelevant" words is much harder than creating tags that contain the most relevant ones, which is the main point of meta-tags. Most of us have brains that are trained to pick out what is important, not the opposite, so few people would bother to implement this. Language is hard, computers are dumb and few people have been willing to "explain" language to them to make search smarter. In other words, nothing like works on a significant scale if much effort has to go into it. Tagging important words can be semi-automated with summarization software, which will accomplish much more in terms of relevancy ranking than tagging the ones to ignore. And by the way, this proposal misunderstands robots.txt. The point isn't to conceal the existence of pages, it is to tell *robots*, not people, to stay away from them. (I'm the owner of the mailing list [mccmedia.com] for it.
  • Many search engines today use very little of the actual text on a webpage for indexing. The "good" ones use the title and the anchor text of the pages that link to a given page as the main scoring features for page relevancy. Only when there are very few hits will a search engine resort to using the actual content text on the page and it is even less likely to use the meta data.

    There were a couple of interesting papers at the ACM's SIGIR [acm.org] this year that use only the anchot text that points to a webpage to get a description of the pointed to page and they could do some cool things like language translations with just that data.

  • I know of at least one web page that has been very carefully constructed so that search engines won't find it, but people who know what they're looking for will find it easily.

    With no subject-specific keywords, however, unless you do know what the author is talking about, you won't have any idea what she's so pissed off about.

    No, don't ask: I am routinely pissed off for the same reason, and will not post the URL here.

    I wouldn't mind if searches for my name brought up my current web page, rather than the one I had in 1995. But that's another matter.

    ...laura

    • Why don't you just put spaces in the keywords? Like saying "This page is about S n o o p y. Or if you were ranting in a blog and didn't want to get perv hits you could easily bitch about "That a s s h o l e emailed me yesterday!" etc... I don't expect search engines to every want to fix that ever.

      t.

  • Matteo Ricci (he's listed in a bibliography; there is no info to speak of)

    While I have occasionally found a source I needed from a hit on a bibliographic entry, one of my pet-peeves, even on Google, is long lists of nothing but bibliographic entries. Usually it's a pretty clear sign that there isn't much on the topic available on the Internet, but sometimes I just need to change my search terms slightly.

    But I think nonword is a bad idea. If the website's editors decide to keep a word, and Google's page-rank technology shows it to me, I'm willing to check it out.
  • For a search engine at a single site, this is very useful. You watch the queries and results. If a page doesn't show up, but it should, you add the search terms to the keywords. If it shows up, but you don't want it to, what do you do? Create an anti-keyword field.
  • I received the following email message from the CFO of a company called LabBook [labbook.com], about my Bull Shit Markup Language (BSML) web page [catalog.com].

    Appearently, they would prefer that people searching for "BSML [google.com]" did not turn up my web page. I wonder if they've tried to get the Boston School for Modern Languages [bsml.com] to change their name, too?

    Now isn't the whole point of properly using XML and namespaces to disambiguate coincidental name clashes like this? If LabBook thinks there's a problem with more than one language named BSML, then they obviously have no understanding of XML, and aren't qualified to be using it to define any kind of a standard.

    Maybe LabBook should put some meta-tags on their web pages to decrease their relevence when people are searching for "Bull Shit" or "Modern Language".

    -Don

    ========

    From: "Gene Van Slyke" <gene.vanslyke@labbook.com>
    To: <don@toad.com>; <dhopkins@maxis.com>
    Sent: Monday, November 12, 2001 10:36 AM
    Subject: BSML Trademark

    Don,

    While reviewing the internet for uses of BSML, we noted your use of BSML on http://catalog.com/hopkins/text/bsml.html [catalog.com].
    While we find your use humorous, we have registed the BSML name with the United States Patent and Trademark Office and would appreciate you removing the reference to BSML from your website.

    Thanks for your cooperation,

    Gene Van Slyke
    CFO LabBook

    ========

    Here's the page I published years ago at http://catalog.com/hopkins/text/bsml.html [catalog.com]:

    ========

    BSML: Bull Shit Markup Language

    Bull Shit Markup Language is designed to meet the needs of commerce, advertising, and blatant self promotion on the World Wide Web.

    New BSML Markup Tags

    CRONKITE Extension

    This tag marks authoritative text that the reader should believe without question.

    SALE Extension

    This tag marks advertisements for products that are on sale. The browser will do everything it can to bring this to the attention of the user.

    COLORMAP Extension

    This tag allows the html writer complete control over the user's colormap. It supports writing RGB values into the system colormap, plus all the usual crowd pleasers like rotating, flashing, fading and degaussing, as well as changing screen depth and resolution.

    BLINK Extension

    The blinking text tag has been extended to apply to client side image maps, so image regions as well as individual pixels can now be blinked arbitrarily.

    The RAINBOW parameter allow you to specify a sequence of up to 48 colors or image texture maps to apply to the blinking text in sequence.

    The FREQ and PHASE parameters allow you to precisely control the frequence and phase of blinking text. Browsers using Apple's QuickBlink technology or MicroSoft's TrueFlicker can support up to 65536 independently blinking items per page.

    Java applets can be downloaded into the individual blinkers, to blink text and graphics in arbitrarily programmable patterns.

    See the Las Vegas and Times Square home pages for some excellent examples.

    • Oh no, I am quaking in my hip boots, and up to my chin in deep doo doo. A big corporation is trying to claim the rights to BSML, the name of my invention: Bull Shit Markup Language [catalog.com].

      The wheels of government and commerce would grind to a halt were they not well lubricated with Bull Shit. So I created the Bull Shit Markup Language and published the BSML web page [catalog.com] years ago, putting it on the public domain for the good of mankind. Now somebody has finally taken it seriously, and is trying to monopolise BSML!

      He who controls BSML controls the Bull Shit... and he who controls the Bull Shit controls the Universe!

      http://catalog.com/hopkins/text/bsml.html [catalog.com]

      Does anyone know of any prior art pertaining to Bull Shit and Markup Languages? What about VRML -- Maybe I could get Mark Pesche to testify on my behalf? c(-;

      Here's a list of the huge faceless multinational corporations I'm up against:
      http://www.labbook.com [labbook.com]
      "IBM, NetGenics, Apocom, Bristol-Myers Squibb, Wiley and other leaders of the life sciences industry support LabBook's BSML as the standard for biological information".

      To paraphrase Pastor Martin Niemöller:

      First they patented the Anthrax Vaccine
      and I did not speak out
      because I did not have Anthrax.
      Then they patented the AIDS Drugs
      and I did not speak out
      because I did not have AIDS.
      Then they patented Viagra
      and I did not speak out
      because I already had an erection.
      Then they came for the Bull Shitters
      and there was no one left
      to speak out for me.

      -Don

Mommy, what happens to your files when you die?

Working...