Subscribe to ePluribus Media



ePluribus Media Store


Want Headlines via Email?
Enter your email address:


Help Save 1.800.SUICIDE


U.S. Attorneys data dump, made searchable for researching

by rcs1

Originally posted Tue Mar 20, 2007 at 03:06:05 PM CST

On the ePluribus Media Community, the team has taken the over 2000 pages of PDFs that the Department of Justice (DOJ) dumped late Monday 3-19-07 night and made them searchable.

ePluribus Media decided to make these converted PDFs documents publicly available so that anyone can easily do text searchs on them for relevant information.

Please feel free to make use of these for your researching purposes...post and let us know if this service is helpful to you. (And, of course, we'd be honored if you'd consider a donation to help us continue the work.)

We have set up commentaries to gather the research research on specific U.S. Attorneys and White House involvement. Anyone who would like to contribute to them can add info in comments:
Margaret Chiara
Daniel Bogden
Paul Charlton
Carol Lam
David Iglesias
Bud Cummins
Kevin Ryan
John McKay
Patrick Fitzgerald
White House (EOP)

As one should with any conversion of scanned images using optical character recognition, it's best that everyone always check the results against the original files, in this case, those on the House Judiciary site.

  The first batch of PDF links are below the fold:


commentary :: :: :: buzz-it!



If you like what ePMedia's been doing with research, reviews and interviews, please consider donating to help with our efforts.

Part 1-1
Part 1-2
Part 1-3
Part 1-4
Part 1-5
Part 1-6
Part 1-7
Part 1-8
Part 1-9
Part 1-10
Part 1-11
Part 2-1
Part 2-2
Part 2-3
Part 2-4
Part 2-5
Part 2-6
Part 2-7
Part 2-8
Part 2-9
Part 3-1
Part 3-2
Part 3-3
Part 3-4
Part 3-5
Part 3-6
Part 3-7
Part 3-8
Part 3-9
Part 3-10
Part 4-1
Part 4-2
Part 4-3
Part 5-1
Part 6-1
Part 6-2
Part 6-3
Part 6-4
Part 7-1
Part 7-2
Part 7-3
Part 7-4
Part 7-5
Part 7-6
Part 7-7
Part 7-8
Part 7-9
Part 7-10
Part 7-11
Part 8-1
Part 8-2
Part 9-1
Part 10-1
Part 10-2
Part 11-1
Part 11-2
Part 11-3
Part 11-4
Part 11-5
Part 11-6
Part 12-1

Display:
In one of Rayne's earlier posts I started going down the avenue of wondering if there could possibly be any connection between USAs' roles on the Attorney General's Advisory Committee and the purge.

In Part 1-2, there's a letter to DAG Paul McNulty from the USAs on the Regional Law Enforcement Information Sharing Working Group of the AGAC.

Is it a coincidence that the following USAs who were either forced out or have the appearance of having been forced out were on that group, or is there a connection?

John McKay
David Iglesias
Carole Lam
Paul Perez
Debra Wong Yang

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
by wanderindiana on Tue Mar 20, 2007 at 04:39:44 PM EST
that same idea... could use your brainpower on it!

by Cho on Tue Mar 20, 2007 at 04:53:00 PM EST
[ Parent ]
Here's a list of other place where people are sifting wheat from chaff.

http://www.tpmmuckraker.com/archives/002809.php
http://editthis.info/tpmmuckraker1/ (wiki)
http://www.docstrangelove.com/gonzopedia/index.php/Main_Page (wiki)

by intranets on Tue Mar 20, 2007 at 08:24:31 PM EST
[ Parent ]

on Dailykos pointing to a diary where he has posted a downloadable zip files of the documents.  We can't vouch for the safety or quality of the files but for those who might want to take a look and are interested in a smaller version to put on your hard drive, it is an alternative.  

by standingup on Wed Mar 21, 2007 at 12:51:31 AM EST
[ Parent ]
Or see intranet's comment here:

http://scoop.epluribusmedia.org/comments/2007/3/20/1665/33040/24#24

That comment contains direct links to the much smaller files I posted.

by jukeboxgrad on Wed Mar 21, 2007 at 01:14:10 AM EST
[ Parent ]

this public service announcement has been made available in ORANGe.  Consider recommending to keep it in the public eye.

by Cho on Tue Mar 20, 2007 at 04:56:33 PM EST
The best part I found in Parts 1&2 was the handwritten letter blanked on the front page, but they scanned the back page, and you can almost make out the redacted info.

http://editthis.info/tpmmuckraker1/DAG942

Take a look and try to decipher the notes.

by intranets on Tue Mar 20, 2007 at 05:19:23 PM EST

Take a look
http://farm1.static.flickr.com/169/428204128_8e9e91b9c9_o.jpg

Also, check out Gonzopedia
http://www.docstrangelove.com/gonzopedia/index.php/Main_Page

Which appears to be wiki effort to parse out good bits from the doc dump.

by intranets on Tue Mar 20, 2007 at 05:31:22 PM EST
[ Parent ]

I'm confused. I thought there were 93 USA.

DOL

2-6 DAG00000254-256

Letter to Senator Mark Pryor (D-AR)
from Richard A. Hertling,
Acting Assistant Attorney General
dated January 31, 2007,
Office of the Assistant Attorney General

p 254

As the Attorney General also stated to you, the Administration is committed to having a Senate-confirmed United States Attorney for all 94 federal districts.

p 255

As you will see, the enslosed information establishes conclusively that the Administration is committed to having a Senate-confirmed United States Attorney for all 94 federal districts.

Cut and paste error? Or am I missing some important detail?

by susie dow on Tue Mar 20, 2007 at 08:07:51 PM EST
Someone was calling Bush out on saying '93 USAs'

I guess (but don't know) that Guam is number 94.

by intranets on Tue Mar 20, 2007 at 08:21:00 PM EST
[ Parent ]

http://www.usdoj.gov/usao/
There are 93 United States Attorneys stationed throughout the United States, Puerto Rico, the Virgin Islands, Guam, and the Northern Mariana Islands.
I think I'm going to start search the documents just for the phrase:
the Administration is committed to having a Senate-confirmed United States Attorney for all 94 federal districts.
See if there's any other use of it, if it's a talking point, what is its history, etc.

But..what if there is a 94th?

by susie dow on Tue Mar 20, 2007 at 08:26:52 PM EST
[ Parent ]

Susie,

What if there are 94 districts but only 93 USAs (like one USA covers two, or DC relies on DOJ instead?)

The Department of Justice's Office of Justice Programs (OJP) is making a total of $32 million available for awards to Project Safe Neighborhoods Task Forces formed in the 94 U.S. Attorney districts

From Reno's days

Each of the 94 U.S. Attorney's offices
has designated a child support enforcement coordinator.

Maybe we should rely on google for what the real number is:
about 337,000 for "93 U.S. Attorneys"
about 844 for "94 U.S. Attorneys"

Huh!?  Looks like the correct answer is 93.. very odd.

-----------
GAO paid for 94 in 1998

GAO paid out $1B in 2001 for 94 of them,

U.S. ATTORNEYS

SALARIES AND EXPENSES

Appropriations, 2001     $1,247,631,000
Budget estimate, 2002     1,346,289,000
Committee recommendation     1,260,353,000

This account supports the Executive Office for U.S. Attorneys [EOUSA] and the 94 U.S. attorneys offices throughout the United States and its territories. The U.S. attorneys serve as the principal litigators for the U.S. Government for criminal and civil matters.

And in May 2004 was 94...
http://www.gao.gov/new.items/d04422.pdf
(Really good read BTW..  GAO -- "U.S. ATTORNEYS
Performance-Based Initiatives Are Evolving"
)
Might be worth comparing the workload tables and performance in that docuement to the DOJ tables in the doc dump.

According to ePM journal, there are 94..
Why is google that far off ??  This is a scary memoryhole.

Someone check the 2006 GAO report for US Atty salaries!!!!

by intranets on Tue Mar 20, 2007 at 09:35:30 PM EST
[ Parent ]

...but one is a header line, if you were to lift the current list of USA's off the USDOJ website.  The other 93 are the USA's, including DC, Guam, etc.

Probably a simple boo-boo, unless somebody is thinking about adding a new district and mentally jumping the gun...?

by RayneToday on Tue Mar 20, 2007 at 09:38:33 PM EST
[ Parent ]

Cool I found it in 1999 Congressional testimony.
These 93 U.S. Attorneys serve the nation's 94 federal judicial districts. One U.S.
Attorney serves both the District of Guam and the District of the Northern Mariana


by intranets on Tue Mar 20, 2007 at 09:47:25 PM EST
[ Parent ]
I searched all of the "2's"

Searching phrase: 94 federal districts

Pt 2-07
January 16,2007
Letter: Hertling to Leahy & Feinstein
DAG000000313

by susie dow on Tue Mar 20, 2007 at 08:59:07 PM EST

I do work as a contract attorney reviewing obscene mounds of multilingual documents and coding them according to litigation or Justice Department specs.

I think there might be value in a document review that busts the documents down, document by document if not page by page, and check-box marks them according to traits (e.g. Sampson, Junk, Rationalization, Rove, Congress, Local News, POTUS, Performance Facts, Performance Review, Replacements,  "Smoking Gun", etc.)  This would buttress the searching power by creating lists of documents that "really" meant what they were coded (e.g. "Sampson ordered carryout" in an email should be coded differently from "Sampson ordered the firing of ___."

I put up a diary on Daily Kos along these lines before I knew the great work you guys are doing.  So I would be grateful for any comments, technical wisdom, etc., as well as a sober judgment as to whether that would be a useful exercise now or in the future.

by crablaw on Tue Mar 20, 2007 at 09:04:06 PM EST

I think for the techier people in the community who aren't familiar with document production for litigation  using the term "tagging" might make it easier to describe what needs to be done.

Most folks who are regulars at DailyKos are already familiar with tagging in terms its application for search across diary content.  In the world of law firms, tagging is generally referred to as coding. (In tech/IT world, coding means writing programs.  The language barrier can get in the way if we don't clarify this.)

For this document dump, we need to have each document, each page tagged in such a way that we can find "emails" versus "draft testimony", and emails that are "From Sampson" versus "From Goodling", and so on.

Does that make sense to volunteers?  I hope so!

by RayneToday on Tue Mar 20, 2007 at 10:05:13 PM EST
[ Parent ]

Indeed - document review is sometimes nothing more than tagging consistently boring content under a formal structure, with free coffee.  But great point.

by crablaw on Wed Mar 21, 2007 at 10:06:25 PM EST
[ Parent ]
Excellent organizational idea...something for us all to consider as we catch our breath..More documents getting dumped tonight I hear.

Welcome!

by Cho on Tue Mar 20, 2007 at 09:53:57 PM EST
[ Parent ]

on this mighty effort.  You're doing the Republic's work.

by Captain Future on Tue Mar 20, 2007 at 10:21:54 PM EST
How are you doing!  We've missed you around here lately and your excellent commentaries!  Good to see your User ID.

by Cho on Tue Mar 20, 2007 at 10:52:21 PM EST
[ Parent ]
There's a way to make these files about 90% smaller. I've done so. See here: http://www.dailykos.com/story/2007/3/21/03859/5511

by jukeboxgrad on Tue Mar 20, 2007 at 11:57:11 PM EST
I believe the ePM conversion includes the original tif (with text behind it) so that is why it is so much bigger (i think).

Do you give approval to host your pdf here?

Nice job.

by intranets on Wed Mar 21, 2007 at 12:10:01 AM EST
[ Parent ]

I think you're exactly right about the tiff. As far as I can tell, it doesn't add anything and just makes the files much harder to handle.

Yes, of course, I'd be delighted if the files were hosted here. As far as I'm concerned, anyone who would like to host them should grab them and host them.

by jukeboxgrad on Wed Mar 21, 2007 at 12:15:16 AM EST
[ Parent ]

Would you mind putting up the link to the files in a comment here?

by roxy317 on Wed Mar 21, 2007 at 12:13:13 AM EST
[ Parent ]
There are actually three different links, because I provide the material in 3 different forms. That's explained in the diary: http://www.dailykos.com/story/2007/3/21/03859/5511

Feel free to grab the files and host them here. But the text in the diary is helpful because it explains how the 3 links provide the data in different forms. Move that text here, if you like. Whatever it takes to get the material out there so people can use it.

by jukeboxgrad on Wed Mar 21, 2007 at 12:18:33 AM EST
[ Parent ]

Instead of the waiting, I put direct links.  They may not work if it is cookie based, but download them and host on ePM.

A) DOJ consolidated.txt.zip is a 1.5mb zip file. When unzipped, this yields a text file of 5mb. This text file contains all the text from the original 61 files. Just no graphics. That's why the file is small.

B) DOJ consolidated.pdf is 27.8mb. This is a single text-searchable pdf that contains a compilation of all 61 files.

C) DOJ individual files.zip is a 26.5mb zip file. When unzipped, this yields a folder containing 61 text-searchable pdfs. This approach corresponds to the way the material was originally packaged. The folder size is 29.3 mb.

by intranets on Wed Mar 21, 2007 at 12:28:00 AM EST
[ Parent ]

That should work.

by jukeboxgrad on Wed Mar 21, 2007 at 12:31:25 AM EST
[ Parent ]
I had to un-arm the javascript make-you-wait stuff to point to direct file.  Roxy, you can either use their bandwidth until the file expires or upload here and update linkys.

by intranets on Wed Mar 21, 2007 at 12:34:38 AM EST
[ Parent ]
Very clever. How did you do that?

I paid them ten bucks, thinking it would mean no one would have to wait. On account of what you wrote, I figured out that the ten bucks only means that I don't have to wait.

So the direct links you're using are very clever, since they cut out the wait. So I tried editing my diary to use those links instead. But in that context, they still triggered the wait, even though they don't trigger the wait when I click on them in your comment here.

I think you know why. Something to do with cookies? I appreciate the chance to learn something.

"you can either use their bandwidth until the file expires"

I think the deal there is they'll host the files indefinitely. But I'm perfectly happy if you just host them here instead.


by jukeboxgrad on Wed Mar 21, 2007 at 01:07:39 AM EST
[ Parent ]

I'm not 100% sure my direct link will work. Usually those sites make sure you can work around their wait to download thing.  (You usually have to pay to not wait)

I looked at the source and the click to download after you wait is a javascript: with the actual hard link.  They usually don't allow you to download from there (there are http referrer things and cookies)  since the link is some huge alphanumeric thing it probably is cookie based... I tried those links from IE instead of FF (not a cookie issue) and it worked.. But that link may only work for my IP?

Anyways if that link works for folks here it should work in dkos.. maybe the URL was too long or didn't get fully copied?  Or else you HAVE to click and wait..

by intranets on Wed Mar 21, 2007 at 01:27:47 AM EST
[ Parent ]

So the best would be to upload to some of those file hosting places then find the real link, then replace with coral cache.. so as long as site is up, you are actually downloading from nyud.net

Like this

by intranets on Wed Mar 21, 2007 at 01:38:27 AM EST
[ Parent ]

A this point I've got direct (undelayed) links in the dairy, and they seem to work OK. I'm not sure why I was having trouble with that before. Maybe something was stuck in a cache somewhere. I rebooted in-between, for other reasons.

Anyway, thanks for helping me learn stuff.

by jukeboxgrad on Wed Mar 21, 2007 at 03:01:58 AM EST
[ Parent ]

This comment has been deleted by intranets



by intranets on Wed Mar 21, 2007 at 02:08:57 AM EST
[ Parent ]
I uploaded a zip of all 61 of the above .pdf

No direct link you have to go here:
http://www.filefactory.com/file/08a808/
Click on "Basic Download"
Type in four characters
Download DOJDocPt1-12_epm.7z (140MB)

(if you don't have 7-zip it's free)

by intranets on Wed Mar 21, 2007 at 03:19:02 AM EST
[ Parent ]

Roxy, one last thing, should there be a note at the bottom that says, obviously, the OCR text conversion may have some errors.

And.. I'd recommend replacing all the links above with coral cached versions to same a huge $$$ bill on hosting this month.

(you add  --> .nyud.net:8080)

http://www.epluribusmedia.org/documents/DOJDocsPt12-1070319_epm.pdf
becomes
http://www.epluribusmedia.org.nyud.net:8080/documents/DOJDocsPt12-1070319_epm.pdf

by intranets on Wed Mar 21, 2007 at 12:43:47 AM EST

Roxy, both your diary and this article lack updates indicating that an alternate version is now available (which people might find helpful for reasons that I explain in my diary). It's not ideal that people find out only by scrolling through comments. And folks who might like to help could be discouraged by the idea that they have to click 61 separate links and download 200mb of material. They wouldn't know that's not necessary unless they scroll down and read the comments.

Just a suggestion.


by jukeboxgrad on Wed Mar 21, 2007 at 09:04:45 AM EST

I am on a dial-up...... So I am particularly weary of using and downloading alternative software......... I'll personally stick with the tried and true.

by avahome on Wed Mar 21, 2007 at 01:32:39 PM EST
[ Parent ]
Thank you for the input.  You are more than welcome to post a commentary with the information on the files that you have made available as an alternative.  We made decisions internally for the way that we chose to post the data.  We understand you are not aware of those and may not have considered that before you made your comment.  

by standingup on Wed Mar 21, 2007 at 01:38:51 PM EST
[ Parent ]
and all your help!

Quick explanation: We wanted to:

  1. "preserve the original documents" with all of the handwritten notes, images, etc.
  2. provide the ability to always read the original scan if there are suspect characters, which you can do with the original tiff behind the text overlay
  3. provide a much higher percentage of "correct" character recognition, which you get
     by retaining a high resolution prior to the text overlay

Some of the research involves (check out CoachMcGuirk's commentary over on the right), being able to see as clearly as possible the original.

Roxy's been doing this for years for the San Jose Public Library; Nike; University of New Mexico; Pentacle Press,taking legacy docs and out of print books into text-searchable PDF for online deliver, Powells, and has been the consultant to HP and FIERO on their scan engines and OCR:
Johns Hopkins Medical, and Sun Micro Systems among others...

Thanks again, for all your help; it's been tremendous!  


by Cho on Wed Mar 21, 2007 at 01:56:43 PM EST
[ Parent ]

Thanks to all of you for your kind responses.

"We wanted to: ... provide a much higher percentage of 'correct' character recognition, which you get by retaining a high resolution prior to the text overlay"

I understand that's what you wanted, but it's not what you got. In fact, it's the opposite of what you got. Upon examination I quickly notice that your files have a much lower percentage of 'correct' character recognition (compared with the alternate version). I could give you a very long list of examples (which I derived by using a text-comparison program), but at the moment I'll just point out one very glaring and troublesome example.

Please open your first file (DOJDocsPt1-1070319_epm.pdf), and look at the bottom of the first page. You'll see this page number: DAG000000425. Double-click on it to select it. You'll find that you can't. That's because this text was not recognized as text by your OCR program.

Here's another way you can prove that text wasn't recognized: try searching for it. Type the page number ("DAG000000425") into Adobe Reader's "Find" field. Reader will tell you "no matches were found."

Is this a problem only on the first page? Definitely not. Try searching for this string: ''DAG0000". Notice how many are found: 2. This despite the fact that page numbers beginning "DAG0000" appear on almost every single one of the 49 pages in this particular file.

Is this a problem only in the first file? Definitely not. Try searching for the string ''DAG0000" in all your files (Adobe Reader's "Search" feature allows you to search all pdfs in a given folder). This is how many you'll find: 39. This is how many you should find: approximately 1557. How do I know? Because that's how many you'll find if you do the exact same search in the 28mb consolidated pdf (which I've referred to as "the alternate version").

(Incidentally, the search that finds 1557 items is about 20 times faster than the search which finds 39 items. This is the opposite of what one would expect, but it works this way because for technical reasons that are not hard to imagine, Acrobat Reader can search much more efficiently in one file that's 28mb in size, as compared with a folder of 61 files that have an aggregate size about 7 times greater.)

(Also, the number is 1557 instead of 3000 mostly because the prefix "DAG" is not always used. For example, another common page-number prefix is "OAG.")

In other words, your OCR program ignored roughly 97% of the page numbers (which the program used to create the alternate file did not ignore). Page numbers are very important, for obvious reasons. What else did your program ignore? I don't know, but if I spent more time analyzing your files I could probably prove the answer is a lot.

"We wanted to: ... 'preserve the original documents' with all of the handwritten notes, images, etc."

The alternate pdf also preserves most of that material. In a handful of places (I estimate less than 1% of the pages) where handwriting is not well-preserved in the alternate pdf, I think the original files do a nice job of filling that very small gap.

More importantly, the alternate pdf contains much more accurate text (which is obviously important because these documents are mostly text). This is aside from the fact that it's 86% smaller (and is correspondingly that much quicker to download and move around), and also enormously more manageable by virtue of being one file instead of 61.

Other than that, your approach is obviously preferable.

By the way, here's one indication the alternate pdf is working well: it's been downloaded over 300 times since it was first posted less than a day ago, and no complaints have been posted.

I have some other comments but this is all I have time for at the moment.

by jukeboxgrad on Wed Mar 21, 2007 at 10:16:41 PM EST
[ Parent ]

cho: "provide the ability to always read the original scan if there are suspect characters, which you can do with the original tiff behind the text overlay"

This brings up another serious problem with your files. Aside from a very high error rate, a major compounding problem is that "the text overlay" is invisible (on many pages it's also badly misaligned, which makes copying text a very awkward procedure). That means it's nearly impossible to detect the presence of "suspect characters," unless you use somewhat exotic techniques. (This is fundamentally different than the situation in the alternate pdf, where errors in the scanning are almost always immediately obvious to the naked eye.)

This is easily understood with an example. Use Adobe Reader to search your folder of ePM pdfs. Search for "protcgt". Yes, "protcgt". Adobe reader will then take you to page 44 of your file DOJDocsPt2-7070319_epm.pdf. Of course, you won't see "protcgt," because it's hidden in an invisible text overlay. What you'll see is a poorly aligned rectangle in the general vicinity of the word "protege." In other words, your OCR program thinks that "protege" is spelled "protcgt" (in a cubicle somewhere is a programmer who enjoys covertly acting out his hostility for the French by keeping French-sounding words out of an OCR dictionary). So you'll never find this instance of the word "protege" by searching, unless you realize that the proper way to spell it is "protcgt." And you'll most likely never realize this, and you'll never know what you're missing, because "protcgt" is invisible.

Here's another example of the exact same phenomenon. Do another search in your folder of 61 pdfs, this time looking for the string "karl row." Adobe Reader will take you to page 5 of your file DOJDocsPt3-7070319_epm.pdf. You'll see that misaligned highlight again, but you won't see "karl row." You'll see "Karl Rove."

Is it possible that someone might want to find all the places where Rove is mentioned? Yes, but that search won't be successful, and you won't know that it wasn't successful, unless you know the right way to misspell "Rove."

Is it possible that "Rove" is sometimes scanned incorrectly in the alternate pdf? Yes, but it's less likely, because the overall error rate is much lower. And if there is such an error, a reader will have at least a fighting chance of being able to spot it, because the error is not hidden in an invisible text overlay.

Another interesting example. If you search in the alternate pdf for the string "sampson," you'll get 1149 instances. If you search in the ePM files, you'll only get 946. What happened to the other 103 "sampsons?" Your program decided to liven things up by secretly using alternate spellings, in those instances. Here's one alternate spelling your program likes a lot: "Sarnpson." Here's how many times your program thought that was the right way to spell "Sampson:" 75.

Does that problem ever crop up in the alternate pdf? Yes: twice.

In a very similar manner, you're missing about 25% of the occurrences of "iglesias" and "chiara." You're missing about 30 instances of "carol lam." You're missing "charlton" about 35 times. Names are an acid test for OCR because the program doesn't get to rely on a dictionary. But obviously names are crucial in a project like this. Your program is getting names wrong roughly 10-25% of the time. This error rate is consistently much lower in the alternate pdf. After trying many names, I can't find a single instance where the count isn't lower in the ePM files.

By the way, I agree that it's a good idea to have "the ability to always read the original scan if there are suspect characters." This is very easily accomplished by opening the original files. Along with the alternative pdf, which is a compilation, I also created a batch of 61 small files that correspond to the original files, specifically as a cross-reference for the purpose of making it easy to quickly locate any original page. I'd be glad to help anyone who wants to understand how to do this (there's also some explanation about this in the diary where all these files are hosted). The steps are pretty simple. My email address is posted at dKos.

by jukeboxgrad on Thu Mar 22, 2007 at 12:02:08 AM EST
[ Parent ]

Look, OCR can give different results.  Your OCR has errors in some areas that differ.  It's not about which one is better or trying to ignore your efforts.  

I would recommend people check both versions if they have a specific name or word search.  

But roxy has contributed a lot to this site and publications and spent a lot of time converting these pdfs.  So quit being so critical here, I could point out tons of errors on butchering in your OCR.  It's just a image processing algorithm.

As you said,

I shouldn't exaggerate. I notice the much larger ePM files include some things that are dropped in my files, like certain signatures. I guess the bottom-line is that OCR isn't perfect, and sometimes it's necessary to look back at the original files.

I agree a single pdf is a good idea and probably will eventually get around to it, and the whole archive is only 7 times bigger and has original scans.

It's not competition, but a public service, just like what you did.

by intranets on Thu Mar 22, 2007 at 12:19:05 AM EST
[ Parent ]

:)

by roxy317 on Thu Mar 22, 2007 at 11:30:28 AM EST
[ Parent ]
"Your OCR has errors in some areas that differ ...  I could point out tons of errors on butchering in your OCR."

So could I. Currently, OCR can't be done without errors (given source material like this). But it's obviously desirable to minimize the errors. It's simply a fact that the ePM files have a dramatically higher error rate. I've demonstrated that in detail, and there's a ton more proof that I haven't gone to the trouble of presenting. If you can point out where my analysis is flawed, I would appreciate the chance to learn something.

This doesn't mean that ePM is a bad organization or that Roxy is a bad person. It's just means that a lot of people (potentially 100% of the people who read Roxy's diary and/or article without bothering to scroll down into comments) are going to be tripping over errors for no good reason.

"It's not about which one is better ... "

It's not about which person is better, but it is about which tool is better. We should maximize our precious collective efforts by using the best possible tools, without regard to where they came from or who gets the credit. Once everyone does their best to throw something in the pile, we should pick what works best and forget where it came from. Where it really came from is everyone.

" ... or trying to ignore your efforts."

I think you would be surprised if you knew how little I care about whether or not I'm ignored. Likewise, I think you would be surprised if you knew how much I care about the importance of this particular collective effort.

"But roxy has contributed a lot to this site and publications and spent a lot of time converting these pdfs."

I'm sure Roxy is an outstanding individual. And for obvious reasons I'm in a position to have some insight into how much work she put into this. But the issue is not me or Roxy. The issue is delivering the best possible tools in order to maximize the likelihood of a favorable outcome. There's no question that Roxy worked hard, and there's no question that she deserves all sorts of credit for that. Giving her that credit is a separate matter from making smart decisions now about how to help our community be as effective in this effort as possible.

"I notice the much larger ePM files include some things that are dropped in my files, like certain signatures."

The ePM files have an advantage in one area: handwriting. But the handwriting is obviously not searchable, anyway, and it's portrayed perfectly in the original files. And it appears on less than 1% of the pages. Therefore in my opinion the simplest way to deal with the 30 or so pages that contain potentially interesting handwriting is to eyeball them in the original files.

"I agree a single pdf is a good idea and probably will eventually get around to it"

It's a good idea provided it's practical in size and not filled with errors.

For my own convenience and testing purposes, I already converted the ePM files into a single pdf. It's done with a simple command if you have a full version of the Adobe Acrobat program (having 2gb of ram also helps a lot). The result is a file that's 154mb. That's better than the original 204mb folder, but it's still too big. By offering files so large, you're effectively limiting the pool of people who can participate, because only the newest machines can do a very good job of working with such giant files. Likewise, you're also telling people with dialup that they can't play unless they want to tie up their phone line for 10 hours or more. For obvious reasons, it's a good idea to make it possible for a large number of people to participate.

Aside from the size, that 154mb file obviously contains 100% of the errors that are in the 61 separate files. In my opinion, the biggest problem is not the overall size or the fact that it's painful to juggle 61 separate files. The problem is the error rate.

by jukeboxgrad on Thu Mar 22, 2007 at 03:03:08 AM EST
[ Parent ]

Support ePluribus Media -- Support Citizen Powered Journalism!

ePluribus Media

↑ Grab this Headline Animator

members


community front page

make a new account


Username:
Password:

create account | faq | search | community front page |