Google Blogoscoped

Forum

Google Sergey,Larry Page,Dianne Wang,Vakish where using yahoo groups in Dec 1998 before became IPO

Chk the conversation [PersonRank 1]

Thursday, November 23, 2006
16 years ago3,804 views

Instructions for using the pageranks:

I have put sample code in the webbase directory.
Take a look at getpagerank.c.
It will work on lilblue since the pageranks are there.
Note the swapshort conversion that converts little endian to bigendian.
If you run this on a PC don't use swapshort.

Also: I have made the data link in the webbase dir automatically point to the
right place depending on what machine you are using.

--Sergey

Here's the deal:
-- no group has a quota, except for the actual diskspace available. If you want
to know how much disk-space is left, type "du ." at the command-line.
-- there is a filesize limitation on lilblue. No one file can be bigger than
1 GB. However, you can have multiple files, each smaller than 1 GB.
(within the constraint of the diskspace available).

I'll look into to freeing up more space on the disks that the group home
directories are on, but you should also be conscious of other groups vieing
for the same resources, and remove any files that you no longer have any
need of.

--D
------------------------------------------------
Tuesday, December 1
4:15: asdf
4:30: cmc
4:45: learn
5:00: scriptfinder
5:15: vamp
5:30: vls
5:45: webtek

Thursday, December 3
4:15: arcmakers
4:30: baconbar
4:45: cat5
5:00: etaoin
5:15: lkr
5:30: tigers
5:45: xyyy
6:00: sbs
Hi class,

I have received several questions about process "sleeping".
This is normal behavior when process reaches the end of the webbase.
This is because it is designed to keep running while the webbase is growing.

I'll try to put in a patch to turn that feature off.

Regards,
--Sergey

My apologies for various system problems today.
All kinds of things went wrong.

lilblue should be back up and accepts ssh and telnet (as a backup).

The search server should be back up and running on palo:3491

Various groups have been asking us for various additional resources.
Unfortunately it is difficult for us to meet a lot of these requests
but we will do our best so feel free to ask. Just don't expect a quick
response.

Regards,
--Sergey

I have received lots of notes about problems with trout and lilblue.
I think they are simply not handling all the load well.

I have just made "palo.stanford.edu" available.
It should be quite fast and has all the data available.
Note it is the same endianess as trout so be careful if reusing code from
lilblue.

Your normal login+pass should work. Also there is a dir /disk/a that has a dir
for each group to put data. Note that there are only 4 GB or so available
free there total for all groups.

Soon, I will make alto available as well.

Let me know if you continue to have machine problems.
Also, be sure to back up your source code to another account as trout is not
getting backed up.

--Sergey

Remember, use data dir /home/webbasedata.
Also note that palo, alto, and trout use Intel byte order AKA vax byte order
whereas lilblue uses network byte order.

--Sergey

Note: /home/webbasedata should now be your data dir on all machines since that
is a soft link that points to the right place.

--Sergey

------ Forwarded Message

Return-Path: kenta[put at-character here]...
Delivery-Date: Fri Nov 27 15:11:57 1998
Received: from elaine14.Stanford.EDU
(elaine14.Stanford.EDU [171.64.15.79])
by DB.Stanford.EDU (8.8.8/8.8.8) with ESMTP id PAA09602
for <sergey[put at-character here]...>; Fri, 27 Nov 1998 15:11:56 -0800
Received: (from kenta[put at-character here]localhost)
by elaine14.Stanford.EDU (8.8.8/8.8.7) id PAA29158;
Fri, 27 Nov 1998 15:11:56 -0800 (PST)
Date: Fri, 27 Nov 1998 15:11:56 -0800 (PST)
From: Ken Taiyo Takusagawa <kenta[put at-character here]...>
Sender: kenta[put at-character here]...
To: "'Sergey Brin" <sergey[put at-character here]...>
Subject: Re: [cs349] Access to alto now available
In-Reply-To: <199811272301.PAA02746[put at-character here]...>
Message-ID: <Pine.GSO.3.96.981127150722.29048A-100000[put at-character here]...>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

I FTPed the repository.* files over the the leland machines:
elaine10..elaine22 in the dir /tmp/repnon. (someone erased the
contents of elaine13)

Ive also found the Perl is quite fast. The following script (on
the 360Mhz elaines) can do 1300 doc/sec.

Ken

#! perl -w
use Compress::Zlib;
$count=0;
while(1) {
$count++;
if ($count %1000==0) {print "$countn";}
read STDIN,$s,4;
while (!($s eq "xb8xd9x01x00")) {
die "$count" unless read STDIN,$f,1;
$s=substr($s,-3).$f;
print STDERR ".";
}
read STDIN,$s,4;
$len=unpack("V",$s);
# print "$lenn";
read STDIN,$build,$len;
$x=inflateInit();
($output, $status) = $x->inflate($build) ;

# print $output if $status == Z_OK or $status == Z_STREAM_END ;
}

#last if $status!= Z_OK ;

If you are not planning on doing your presentation
in HTML, please email me what format your presentation
will be in, and preferably, where I can download it so
that I can get your presentation onto my laptop before
class so that we don't have to spend time during class
trying to download it.

--D

I also created a directory for each group on /trout/b,
which seems to have almost 5 GB of space available.
You can use that for data storage as well.

--D

Hi class. Here are the quiz stats and answers forwarded from Diane:

From: Diane Tang <dtang[put at-character here]...>
Subject: Quizzes
Date: Sat, 28 Nov 1998 17:03:31 -0800

Quiz 1:

Question 1: 4, 9, 5
Question 2: <1, 1>

Average score (among the 27 people who took it): 9.4

Quiz 2:

Support(tea, coffee) = 0.01
Support(tea, jam) = 0.02
Support(coffe, jam) = 0
Confidence(tea->coffee) = 0.1
Confidence(coffee->tea) = 0.05
Confidence(tea->jam) = 0.2
Confidence(jam->tea) = 1
Confidence(coffee->jam) = -
= 0
Confidence(jam->coffee) = 0

Given threshold levels, tea->jam and jam->tea holds, as well as
tea,coffee and tea,jam

With implication threshold, only thing that holds is jam->tea (implication =
infinity)

Pros of LSI:
-- saves space
-- handles "concepts"

Pros of Wordlist:
-- simple to implement + understand
-- easy to create indeces

Average score (among the 41 people who took it): 6.9

You can pick up your quizzes from me during my office hours, or after class
on Tuesday.

--D

Since I have gotten many many emails asking for this,
I have added url2docid.c to the webbase.

It is a sample program that converts urls to docids.
It will work on palo and alto since they have the necessary indeces.

--Sergey

Here are the changes:

addition to makefile:

url2docid: url2docid.o urlhash.o debug.o

url2docid.c:

#include <stdio.h>
#include "search.h"

FILE *checksfp = NULL;
int numurls;

unsigned long url2docid(char *purl)
{
unsigned int hi,lo;
struct UrlChecksum cs;
char *url;
unsigned int i,j,k;
int r;

if (!checksfp) {
checksfp = fopen(CHECKSTOIDSFN,"r");
LOG(("Checksfp %d", checksfp));
fseek(checksfp, 0, SEEK_END);
numurls = ftell(checksfp)/sizeof(struct UrlChecksum);
LOG(("numurls %d", numurls));
}

/* ignore a leading http:// if present */
if(!strncmp(purl, "http://", 7)) {
url = purl+7;
} else { url = purl; }

hi = checksum(url);
lo = checksum2(url);

i = 0;
j = numurls;

while(i<=j) {
k = (i+j) / 2;
/* printf_stderr("trying i=%d j=%d m=%dn", i,j,k);*/
r = fseek(checksfp, k*sizeof(struct UrlChecksum), SEEK_SET);
if(r!= 0) {
LOG(("Couldn't seek checks"));
return 0;
}
r = fread(&cs, 1, sizeof(struct UrlChecksum), checksfp);
if(r!= sizeof(struct UrlChecksum)) {
LOG(("Couldn't read checks"));
return 0;
}
/* LOG(("%u %u %u %u (%u %u %u)n", hi, lo, cs.hi, cs.lo, i, j, k));*/

if(cs.hi == hi) {
if(cs.lo == lo) return cs.docid;
else if(cs.lo < lo) i = k+1;
else j = k-1;
} else {
if(cs.hi > hi) j = k-1;
else i = k+1;
}
}
return 0;
}

int main()
{
char url[1024];

chdir("data");

while (1) {
printf("url> ");
fflush(stdout);
gets(url);
printf("docid: %dn",url2docid(url));
}
}

I have modified repository.cc so that process no longer sleeps when it reaches
the end of the repository.

--Sergey

I have created the following directories on alto you can use for disk space:

/disk/a/cs349 /disk/d/cs349 /disk/g/cs349 /disk/j/cs349 /disk/m/cs349
/disk/b/cs349 /disk/e/cs349 /disk/h/cs349 /disk/k/cs349 /disk/o/cs349
/disk/c/cs349 /disk/f/cs349 /disk/i/cs349 /disk/l/cs349 /disk/q/cs349

I recommend you create a subdirectory under one of those directories to do
your work.

You can get to these from either palo or alto by using /rdisk/alto/...

Also you can get to the /disk/a dir on palo using /rdisk/palo/a.

Also note that the webbase data is local to alto so you get better performance
there.

--Sergey

> Well, if I use gcc then I get the following errors because I am using C++
> iostream and fstream:

You should use gcc to compile the .c files, g++ for .cc files, and g++ to link
it all.

--Sergey

>
> /home/vamp/webbase/handlers.cc:261: undefined reference to `endl(ostream&)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to `cerr'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> ostream::operator<<(int)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(char const *)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(int)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(char const *)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(char const *)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(char const *)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(int)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(char const *)'
> /home/vamp/webbase/handlers.cc:261: undefined reference to
> `ostream::operator<<(ostream &(*)(ostream &))'
> /home/vamp/webbase/handlers.cc:265: undefined reference to
> `ofstream::ofstream(int, char const *, int, int)'
> /home/vamp/webbase/handlers.cc:266: undefined reference to`ios::operator
> void *(void) const'
> /home/vamp/webbase/handlers.cc:268: undefined reference to
> `ostream::operator<<(char const *)'
> /home/vamp/webbase/handlers.cc:269: undefined reference to
> `fstreambase::close(void)'
>
> -Vikash
>

[cs349] Presentations

So there's still no word on whether or not we'll have a net connection
tomorrow. Someone said that they would try to get that setup this morning,
and I haven't heard from him since.

So there are 3 possible technologies you can use:
-- the overhead projector
-- my laptop:
-- powerpoint
-- Netscape Navigator 4.x (not sure, probably 4.0)
-- your own laptop

If you're planning on using my laptop, please email me where I can download
your presenation so just in case we don't have a net connection, you'r not
extremely unhappy.

Also, be sure to practice your presentation. 15 minutes is the maximum
time allowed. If you go under, most likely everyone will be happy. If you
go over, that is much less likely to be the case.

--D

Groups presenting tomorrow:
Tuesday, December 1
4:15: asdf4:30: cmc
4:45: learn
5:00: scriptfinder
5:15: vamp
5:30: vls
5:45: webtek

> Dear Sergey,
>
> > Would it be possible for you to do your presentation as scheduled on Tuesday
> > but to have the project completely finished by the following Sunday?
> We would like very much to be able to do the presentation on
> Tuesday. Unfortunately, it is impossible for us to have the
> necessary statistics available by that day, and the demo will not
> be able to show the effect that we intend to have. In other words,
> we really cannot do the presentation on Tuesday.

We're working to see if we can shift around the presentation schedule.
Unfortunately there are several groups in the same boat as you.
We would prefer if you did your presentation on Tuesday (we will make it
clear that you and several other groups were delayed by machine problems) and
we will give you plenty of time to finish up the project and show us your
results later.

--Sergey

Note that we have noticed that palo and alto have crashed.
We are working to fix them within two hours.

If you have any suspicions about running something that might have caused them
to crash please contact staff before you run it again.

--Sergey
Palo and Alto are back up after a crash unlike any I have ever seen before.

I think that future versions of this class will be also be used as the ultimate
stability test for hardware and software.

Lilblue seems to have magically recovered, but I don't recommend you trust it.

I have taken the opportunity to create a lilblue-compatible data directory on
alto in /home/lilbluedata. This means you can use it for your data directory
and repository pointers computed on lilblue will work.

I hope everyone will get there projects running again.

Note that due to all the chaos with the machines, the groups presenting
tomorrow (today I should say) will not be expected to be completely finished
with their projects.

Good luck,
--Sergey

We have networking in the room now. Networking services had to make
an emergency housecall, but all is fine now.

--D

This is a little late for those presenting on Tuesday, I know, so don't
worry about it too much, but do realize this: you only have 15 minutes.
One good rule of thumb is that one slide (especially if you're tired)
takes about 2 minutes to present. If you speak really fast or don't have
a lot on your slide, then maybe a minute and a half. So figure that you
should have about 7-10 slides, max.

If you've already made your slides (and the ones that I've seen *do* look
nice), you might want to go over your presentation and practice, and
see what you can cut out if time runs short. I'll try to bring a watch
and give every group presenting a 5 minute warning if you think that would
be helpful.

With regards to the technology present, I will bring a PC laptop, with
a network connection, Powerpoint, and Netscape. I will also bring a floppy
drive, but I make no guarantees about it working.

If you're presenting on Thursday in Powerpoint, please make sure to
let me know where I can download your slides – things will go much
more smoothly if all the files are on my laptop in advance.

Thanks!
--D

A bunch of people have asked so I found a way to translate docids to urls.

You need to do:
telnet palo 3491
s 1 id:<docid>

and you will get a result line back with a url, title and so forth.

Another way is to:
telnet palo 3491
r foo
<docid>
<docid>
<docid>
0
c

And you will also get various info back.
Probably the latter is better.

--Sergey

Forget Jack Frost! Come to the Tropics.

What:
Larry, Sergey and the rest of the Googles invite you to the first
celebration at Google, Inc.
************ Fri. Dec. 11th 6PM to 10PM *****************
Dress is Tiki Lounge wear and bring something for the hot tub.
Friends are welcome.

Why:
1. Google: The Stanford Research Project is now Google.com: The Next
Generation Internet Search Company. (And we've been in business 90
web years already.)
2. We have great new people in our team.
3. Our alpha at www.google.com is up and running.
4. We plan to IPO next week. (just kidding)
5. We need more people before we can IPO. Resumes welcome.
6. In appreciation of all the great people who have helped us out
over the years.
7. We have a hot tub.

Where:
Google's Temporary World Headquarters
232 Santa Margarita Ave.
Menlo Park CA

Tel (650) 330-0100 or party[put at-character here].... Regrets only.

Directions from 101
Take the Willow exit south toward Menlo Park. Go through several traffic
lights, and take a right at the light at Middlefield. The immediate first
right is Santa Margarita (right after the restaurant). 232 is 2/3 down the
block on the right. Park on the street and walk down the long driveway between
two houses.

http://www.mapblast.com/mapblast/blast.hm?CMD=GEO&xx=1&id=9125909267237&AD2=
232+Santa+Margarita&AD3=Menlo+Park
%2C+Ca&IC=%3A%3A8&IC%3A=Google
+World+Headquarters

[ Please string this together yourself – Sam]

It will be great to see you there!

Due to the problems with the computers, all of the
groups have an extension on the paper deadline until
Sunday, December 6.

There was a breakin last night on palo and alto about 10:00, and trout
around 3AM 10/4. If you typed any passwords in clear text in or out, they
may have been comprimised, and you should consider changing them.

The breakin was due to a mountd problem on linux – all the machines I
checked had been broken in to (and all my friends machines too). So if you
have a linux box, try more /.bash_history, and if you see any lkr type
stuff, your machine has been broken in too.

The machines seem fine for now, but try to get your projects done soon!

I'll let you know more as I find it out.

-Larry

Due to the security problems, the machines will be going down sometime
after midnight tonight or tomorrow morning. Let me know if that is a
problem for you ASAP.

-Larry

It is possible we will be in a different room tomorrow.
I am awaiting confirmation from the registrar's office.
It will be larger and without the noise (I hope).

If we do change rooms, I will send out an email before the class and
leave a note on the door. The new classroom will be nearby so don't
worry about having to run across campus.

Tomorrow we will start to cover data mining.

Regards,
--Sergey

Class will now be held in Building 370, room 370.
Building 370 is in the Quad, on the side nearest
Gates (sort of "behind" the Math corner if that
helps). Room 370 has an entrance directly from
outside, and is fairly clearly marked.

Reminder: the project proposal for cs349 is due by midnight, this Sunday
(October 18). Please submit the proposal via email in plain-text.

What we expect in the project proposal:
-- what you're trying to accomplish in the project (i.e., goals)
-- why you're trying to accomplish that, what doing whatever you're
doing is a gain over current technology, etc. (i.e., motivation)
-- how you plan on reaching your goals (i.e., initial approach to the problem)
-- when you plan on having what achieved (i.e., schedule, including the
milestone)
-- who's going to be doing what, approximately (i.e., work breakdown)

Not necessarily in this order, format, etc. But that's what I would
recommend. And of course, what you propose, we may not necessarily
agree with, but then again, this is just the proposal. :-)

--D

If you know who's going to be in your group, please send me email
with:
-- a group name
-- who's in the group

and I will create an account for your group. I will send you email
with the group confirmation, and you can get the passwords from me
in class.

--D
To access your account, please

ssh -l <groupname> trout.stanford.edu

ssh is the only way to access the machine.

I've moved all of the group home directories onto disks with space on them,
so you should be able to actually save files and such now.

If you haven't picked up your passwords, please find me before class, during
the break, or after class to pick up your password.

There will be a sign-up sheet in class to make appts to discuss your project
proposal with Serge and Larry next week. Please make sure that your group
signs up.

--D

Here is the current schedule, with approximate email addresses for
each group. If you cannot make your appointment, you have the
following options:
1. Email the group(s) that have the timeslot you want, and see if you
can switch. If the switch occurs, email me as well.
2. Email me, tell me you can't make it, and give me some times that
you could make it (not conflicting with the existing appointments),
and we'll try to work something out with Serge and Larry.

Tuesday:
5:30: scriptfinder (kcsmilak[put at-character here]... ryank)
5:45: etaoin (???)
6:00: Mark Toolis (willey, ??)
6:15: arcmakers (rambler, kaushal, sebbrion)
6:30: lkr (kirchoff, laskin, rokita)
6:45: vamp (vikash, singhala, mpdesai, pdharma)
7:00: vls (ladamic, svemuri, mehta)
7:15: sbs (kstevens, psully, reza)

Thursday:
5:30: asdf (elm, julialee, wesley)
5:45: yisun's group (????)
6:00: cmc (alanwong, tomtong, kenlaw)
6:15: learn (bogdan[put at-character here]playfair, paullo[put at-character here]cs, kuan, kyajima)
6:30: xyyy (xsyang, yuhualiu, yeewah, yxi)
6:45: baconbar (xliang, ms9, dmitrib, dbrussak)
7:00: tigers (onn[put at-character here]cs, ejang, leoraw[put at-character here]cs)

Wants appointment:
vls (ladamic, svemuri, mehta) – can make Thursday, 2:00pm, or
Tuesday/Thursday before class

Groups that I've created accounts for that I don't see on this
schedule:
-- stz (takaoki kenta weizhang)

Groups that I see on the schedule that have yet to email me for an
accout:
-- Mark Toolis (willey, ...)
-- etaoin
-- yisun's group

--D

Hi class,

I've finally made a chunk of the webbase
available for you to play with.

Here are instructions:

login to trout

> cd ~
> mkdir webbase
> cd webbase
> cp -d /usr/local/webbase/* .

# note: the -d option copies the "data" symlink that you need

> make

# everything should compile with just a couple warnings

> ./process read cat | less

# you should see lots of web pages

Send any comments/problems to cs349-staff[put at-character here]egroups.com

Good luck.
--Sergey

The first milestone of your project is due next Thursday. The way it's
going to work is that it'll both be a written summary (approximately 1 page),
as well as a discussion with Serge and Larry after class on Thursday.

Please email me the summary, and have a copy printed out to bring to Serge
and Larry on Thursday. We need both.

Secondly, please email me both 3 preferred times AND blocks of time that you
can and cannot make it. The appointments will be from 5:30 to 9:30 in 15
minutes chunks in the pup lab in the basement of Gates. If everyone emails
me saying that they want the 5:30, 5:45 or 6pm slots, that's not very useful,
as that only lets me schedule 3 groups. Please, please send me both your
preferred times AND blocks of time when you could and could not make it so
that I'll have an easier time scheduling.

Thanks!
--D

For meetings on Thursday (your milestone).

We're not scheduling anything at 5:30 to allow for transport time and
questions after class.

If you are on the list as not being scheduled yet, please email me soon!
--D

5:45: sbs (reza, kristian, peter)
6:00: xyyy (yinong, yuhua, xiaosong, yee wah)
6:15: vls (ladamic, svemuri, mehta)
6:30: learn (paul, bogdan, kuan, ken)
6:45:
7:00:
7:15: baconbar (dmitri, matt, daniel, xiaoli)
7:30:
7:45:
8:00
8:15: vamp (vikash, singhala, mehul, pdharma)

before class (due to flight out of town)
3:45: scriptfinder (ryan, kevin)

Not scheduled yet:
-- arcmakers (rambler, kaushal, sebbrion)
-- lkr (kirchoff, laskin, rokita)
-- asdf (elm, julialee, wesley)
-- cmc (alanwong tomtong kenlaw)
-- cat5 (willey[put at-character here]..., toolis)
-- etaoin (takaoki kenta weizhang)
-- webtek (yisun[put at-character here]cs cchan jlin llo jchen)
-- tigers (onn[put at-character here]cs, leoraw, ejang)

5:45: sbs (reza, kristian, peter)
6:00: xyyy (yinong, yuhua, xiaosong, yee wah)
6:15: vls (ladamic, svemuri, mehta)
6:30: learn (paul, bogdan, kuan, ken)
6:45: lkr (kirchoff, laskin, rokita)
7:00: tigers (onn[put at-character here]cs, leoraw, ejang)
7:15: baconbar (dmitri, matt, daniel, xiaoli)
7:30: cat5 (willey[put at-character here]..., toolis)
7:45: asdf (elm, julialee, wesley)
8:00: arcmakers (rambler, kaushal, sebbrion)
8:15: vamp (vikash, singhala, mehul, pdharma)
8:30: etaoin (takaoki kenta weizhang)

before class (due to flight out of town)
3:30: webtek (yisun[put at-character here]cs cchan jlin llo jchen)
3:45: scriptfinder (ryan, kevin)

Not scheduled yet:
-- cmc (alanwong tomtong kenlaw)

I know that a lot of you have been having problems with compiling. I'm sorry
that
that has been happening. The first problem was that /tmp kept getting full so
that there was no room to put intermediate object files. The second problem (I
think)
was that there wasn't enough memory available. I have since rebooted, and since
rebooting, I haven't had any difficulties compiling. At some point this week, I
will try to upgrade the compiler and associated libraries and other files to see
if
that will help as well.

In general, if you're getting an internal compiler error, that probably means
that
there's something external wrong (/tmp, memory, etc.). In the very near future,
we'll
be making additional machines available for your use.

In the mean time, try and cope as best you can, and I'll try to figure out why
the
compiler and machine are so intermittent. Sorry for the inconvenience!

--D
A lot of people have asked for more information about using the Webbase.
I have put together some helpful examples that show how to create new
handlers for process and how to create parsehandlers.

The code in /usr/local/webbase has been updated. In particular, note that
there are two examples at the end of handlers.cc. You can use
diff /usr/local/webbase <your directory>
to see everything that has changed.

The first is called ACounter and is a subclass of DocHandler. All it does is
count the number of A's in a document. To run it use:
process read http countas | less

here is the code:

/* Acounter below is a very simple example DocHandler that counts the number
of
A's in a document. */

class ACounter: public DocHandler {
public:
/* The function Handle below gets called for every document. */
int Handle(Document *doc) {
char *s;
int ret = 0;

if (!doc->body) {
LOG(("No BODY found; Did you put 'http' in the process command line?"));
return 0;
}
for (s = doc->body; *s; s++)
if (*s=='a' || *s=='A') ret++;

printf("%d:%s has %d a'sn",doc->docid,doc->url,ret);
return 1;
}
};

The second is called ParseTester and it is a subclass of ParseHandler. It
prints out everything the parser encounters. To run it use:
process read http testparser | less

Here is the code:

/* ParseTester below shows everything the parser parses out. */

class ParseTester: public ParseHandler {
Document *doc;
public:
/* This is an example ParseHandler. It demonstrates all the
capabilities of the parser. In you create yourown subclass of
ParseHandler, you only need to define the functions that you
need. */

void NewDocument(Document *d) {
doc = d;
printf("New Document: %d: %sn",doc->docid,doc->url);
};
void EndDocument() {printf("End of Documentn");};
void AddTerm(char *word, int wordlen, int wordtype, int fontsize)
{ printf("Got Term: %sn", word); };
void AddNumber(char *word, int wordlen, int wordtype, int fontsize)
{ printf("Got Term: %sn", word); };
void AddTitle(char *title, int len) { printf("Got title: %.*sn",len,title);};
void AddBaseURL(char *base) { printf("Base URL: %sn",base);};
void NewAnchor(char *url, int len) { printf("Got Anchor: %sn",url); };
void AddImage(char *url) {printf("Got Image: %sn",url); };
void AddAnchorText(char *text, int texttype, int len)
{ printf("Got Anchor Text: %sn",text); };
void AnchorDone(char *anchor, int) { printf("Anchor Donen");};

/* The functions below have had very limited testing. use at your
own risk. */

void NewForm(char *action) { printf("Got Form: %sn",action);};
void NewScript(char *language) { printf("Got Script: %sn",language);};
void NewApplet(char *code) { printf("Got Applet: %sn",code);};
void NewFrame(char *src) { printf("Got Frame: %sn",src);};
void MetaInfo(char *value, char *l) { printf("Got Meta: %sn", value);};
};

Also note the following addition in process.cc so that these can be called:

else if (strcmp(argv[i],"testparser")==0) {
handlers[nhandlers++] = new HTMLParser(new ParseTester());
if (!httpparser) LOG(("WARNING: No HTTP Parser before ParseTester"));
}
else if (strcmp(argv[i],"countas")==0) {
handlers[nhandlers++] = new ACounter();
if (!httpparser) LOG(("WARNING: No HTTP Parser before ACounter"));
}

We'll let everyone run over the main webbase soon.
That will have funcitonality like url2docid.

Regards,
--Sergey

PS Also please note the COPYRIGHT file in the source directory.

I have just made another machine available to those who need more horsepower.
It is an IBM with 4 processors and 512M of memory.

It is called lilblue and you should be able to log in with the same account
and pw you use on trout.

You will need to update your code. Namely, Makefile, bigfile.c, and
repository.cc have been updated.
After that:
ln -s /home/webbase/ibmdata ./data
rm *.o # BE CAREFUL TYPING THIS
make process # expect a lot of warnings
/process read http testparser

Note: the IBM development environment is not nearly as nice as Linux but you
have the entire repository available on lilblue and you have a lot of memory
and CPU.

Also, if youdon't update repository.cc, then you will see a lot of dots when
you run process.

More to come.

Regards,
--Sergey

We have made some new link index files available on liblue.
To see how to use them look at the new demo program readalink.c

--Sergey

Some students have pointed out a typo in a previous message.
The links indexes are on lilblue (note the second l).

If you use lilblue, don't forget to set your data directory to the one
pointed to by ibmdata.

Also as a number of people have discovered, the webbase code changes directory
to
the data directory. So relative paths in your code will be relative to the data
dir.

-------------

We are providing an interface to the backend search server for Google since a
number of groups
have requested the functionality found there.

You can use:
telnet palo.stanford.edu 3491

When connected, the following commands work.

d 10 search terms # will return docids of docs matching the search
terms
# output lines = docid pagerank relevance phrase/nophrase
s 10 search terms # will produce web like search results but the formatting is
tricky.

You can also use:
d 10 link:doc
s 10 link:doc
d 10 flink:doc
s 10 flink:doc

p doc # print the contents of a document

In the above a doc can be a url like www-db.stanford.edu/ (note the trailing /)
or
a docid.

If the search server crashes, please send email to cs349-staff[put at-character here]egroups.com.
Also, please don't tax it too heavily.

Cheers,
--Sergey

I do not have milestone reports from the following groups:
cmc
lkr
scriptfinder
vls

I know some of you were asked to do some additional work before
turning in the milestone but it's been a week now. Please email
me your report.

Thanks!
--Diane

Some of you have been asking what is expected of you at the end of
the quarter. First, there is no final. Second, we do in fact expect
you to prepare a presentation and a paper.

The paper should be in the same format as a publishable research
format. For example:
Abstract
Introduction (Background, Motivation, Goals)
Proposed Solution
Results/Discussion (what worked, what didn't work, how well things
worked, etc.)
Conclusion/Discussion/Future Work
optional: Related Work

The paper should preferably be either in Postscript format, or in
HTML, and will be due Friday, December 4, 1998 at midnight, via email
to me. If it's in HTML, email me the HTML and the URL, rather than
just the URL.

The presentations will span class on December 1 and December 3. I'll
arrange for a room with a computer with a net connection and project
equipment. Hopefully, it'll be a PC that supports PowerPoint, but
more details on that will come.

What I would like to know is if people would be willing to have class
extend from 4pm to 6pm on Dec. 1 and Dec. 3 so that 15 minute
presentations will fit in (You'll have a much easier time trying to
present your project in 15 minutes rather that something more
compressed, trust me). If you can, then I'll start setting up the
schedule. Otherwise, we'll have to try to work for some alternate
solution.

--Diane
---------------

---------------

Yes, yes,I know it's down – it crashed in the midst of me trying to
upgrade the compiler. As soon as I find someone with a key to
that office, I'll reboot it.

--Diane

is installed on trout. Should work, but you may have to tweak the
libraries in the makefile.

http://graphics/infrastructure/howto.html

In the interest of fairness rather than
first come first serve, I created a schedule
for group presentations in the following way:
each group was randomly assigned a 1 or a 3
(1 for presenting on Dec 1, 3 for presending
on Dec 3), and then for the groups on a
particular day, arranged them in alphabetical
order.

If you cannot make the slot assigned to you
(and this will only be applicable if you're
presenting 5:30 or later), then I will swap
your group with another group on the same day.

So here's the schedule:

Tuesday, December 1
4:30: asdf
4:45: cmc
5:00: learn
5:15: sbs
5:30: scriptfinder
5:45: vamp
6:00: vls
6:15 webtek

Thursday, December 3
4:30: arcmakers
4:45: baconbar
5:00: cat5
5:15: etaoin
5:30: lkr
5:45: tigers
6:00: xyyy

Shift everyone up by 15 minutes – for some
reason, I thought class began at 4:30 rather
than at 4:!5.

--D

[ Edited two lines to prevent the page formatting breaking – Sam]

David Hetfield [PersonRank 10]

16 years ago #

can someone please save the above in a document and delete this way too long message?

(no offense)

thanks

Sam Davyson [PersonRank 10]

16 years ago #

David I have replicated the post here:

http://sam.davyson.com/blogoscoped/post-one.html

I am not sure that the post however should be deleted. It is very long – but it is not disturbing anyone – it is only making a very long post at the start of a thread – if it was in the middle of another thread that would be a different story.

Personally I have no idea what the post is on about since I have not taken the time to read it.

If any other moderators want to delete the post for length reasons then they can link to the URL above for the full version.

David Hetfield [PersonRank 10]

16 years ago #

ok..
i wouldve deleted everything except the begging

but i agree with you Sam its not really disturbing anyone.. :)

Chk the conversation [PersonRank 1]

16 years ago #

http://groups.yahoo.com/group/cs349/

too long to read this chk above link u will know.if u already know no.

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!