Should Cards-Layout have lenghty titles..? - responsive-design

I need some supporting UX document negating the use of lengthy Titles for card layout..?


Can OCR software reliably read values from a table?

Would OCR Software be able to reliably translate an image such as the following into a list of values?
In more detail the task is as follows:
We have a client application, where the user can open a report. This report contains a table of values.
But not every report looks the same - different fonts, different spacing, different colors, maybe the report contains many tables with different number of rows/columns...
The user selects an area of the report which contains a table. Using the mouse.
Now we want to convert the selected table into values - using our OCR tool.
At the time when the user selects the rectangular area I can ask for extra information
to help with the OCR process, and ask for confirmation that the values have been correct recognised.
It will initially be an experimental project, and therefore most likely with an OpenSource OCR tool - or at least one that does not cost any money for experimental purposes.
Simple answer is YES, you should just choose right tools.
I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.
When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.
However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.
Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.
And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.
If you describe your task in more detail I can provide you with advice what direction is easier to go.
So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).
If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.
Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.
Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.
If you always have solid borders in your table, you can try this solution:
Locate the horizontal and vertical lines on each page (long runs of
black pixels)
Segment the image into cells using the line coordinates
Clean up each cell (remove borders, threshold to black and white)
Perform OCR on each cell
Assemble results into a 2D array
Else your document have a borderless table, you can try to follow this line:
Optical Character Recognition is pretty amazing stuff, but it isn’t
always perfect. To get the best possible results, it helps to use the
cleanest input you can. In my initial experiments, I found that
performing OCR on the entire document actually worked pretty well as
long as I removed the cell borders (long horizontal and vertical
lines). However, the software compressed all whitespace into a single
empty space. Since my input documents had multiple columns with
several words in each column, the cell boundaries were getting lost.
Retaining the relationship between cells was very important, so one
possible solution was to draw a unique character, like “^” on each
cell boundary – something the OCR would still recognize and that I
could use later to split the resulting strings.
I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!
If you want to try the Tesseract power, maybe you should try this site:
Which OCR you are talking about?
Will you be developing codes based on that OCR or you will be using something off the shelves?
Tesseract OCR
it has implemented the document reading executable, so you can feed the whole page in, and it will extract characters for you. It recognizes blank spaces pretty well, it might be able to help with tab-spacing.
I've been OCR'ing scanned documents since '98. This is a recurring problem for scanned docs, specially for those that include rotated and/or skewed pages.
Yes, there are several good commercial systems and some could provide, once well configured, terrific automatic data-mining rate, asking for the operator's help only for those very degraded fields. If I were you, I'd rely on some of them.
If commercial choices threat your budget, OSS can lend a hand. But, "there's no free lunch". So, you'll have to rely on a bunch of tailor-made scripts to scaffold an affordable solution to process your bunch of docs. Fortunately, you are not alone. In fact, past last decades, many people have been dealing with this. So, IMHO, the best and concise answer for this question is provided by this article:
Its reading is worth! The author offers useful tools of his own, but the article's conclusion is very important to give you a good mindset about how to solve this kind of problem.
"There is no silver bullet."
(Fred Brooks, The Mitical Man-Month)
It really depends on implementation.
There are a few parameters that affect the OCR's ability to recognize:
1. How well the OCR is trained - the size and quality of the examples database
2. How well it is trained to detect "garbage" (besides knowing what's a letter, you need to know what is NOT a letter).
3. The OCR's design and type
4. If it's a Nerural Network, the Nerural Network structure affects its ability to learn and "decide".
So, if you're not making one of your own, it's just a matter of testing different kinds until you find one that fits.
You could try other approach. With tesseract (or other OCRS) you can get coordinates for each word. Then you can try to group those words by vercital and horizontal coordinates to get rows/columns. For example to tell a difference between a white space and tab space. It takes some practice to get good results but it is possible. With this method you can detect tables even if the tables use invisible separators - no lines. The word coordinates are solid base for table recog
We also have struggled with the issue of recognizing text within tables. There are two solutions which do it out of the box, ABBYY Recognition Server and ABBYY FlexiCapture. Rec Server is a server-based, high volume OCR tool designed for conversion of large volumes of documents to a searchable format. Although it is available with an API for those types of uses we recommend FlexiCapture. FlexiCapture gives low level control over extraction of data from within table formats including automatic detection of table items on a page. It is available in a full API version without a front end, or the off the shelf version that we market. Reach out to me if you want to know more.

User2Vec? representing a user based on the docs they consume

I'd like to form a representation of users based on the last N documents they have liked.
So i'm planning on using doc2vec to form this representation of each document but i'm just trying to figure out what would be a good way to essentially place users in the same space.
Something as simple as averaging the vectors of their last 5 documents they consumed springs to mind but am not sure if this might be a bit silly. Maybe some sort of knn approach in the space might be possible.
Then i'm wondering - the same way we just use a doc id in doc2vec, how crazy would it be to just add in a user id token and try that way to get a representation of a user in much the same way as a document.
I've not been able to find much on ways to use word2vec type embeddings to come up with both document vectors and user vectors that can then be used in a sort of vector space model approach.
Anyone any pointers or suggestions?
It's reasonable to try Doc2Vec for analyzing such user-to-document relationships.
You could potentially represent a user by the average-of-the-last-N-docs-consumed, as you suggest. Or all docs they consumed. Or perhaps M centroids chosen to minimize the distances to the last N documents they consumed. But which might do well for the data/goals could only be found by exploratory experimentation.
You could try adding user-tags to whatever other doc-ID-tags (or doc-category-tags) provided during bulk Doc2Vec training. But, beware that adding more tags means a larger model, and in some rough sense "dilutes" the meaning that can be extracted from the corpus, or allows for overfitting based on idiosyncracies of seldom-occurring tags (rather than the desired generalization that's forced when a model is smaller). So if you have lots of user-tags, and perhaps lots of user-tags that are only applied to a small subset of documents, the overall quality of the doc-vectors may suffer.
One other interesting (but expensive-to-calculate) technique in the Word2Vec space is "Word Mover's Distance" (WMD), which compares texts based a cost to shift all one text's meaning, represented by a series of piles-of-meaning at vector positions for each word, to match another's piles. (Shifting words to word-vector nearby-words is cheap; to distant words is expensive. The calculation finds the optimal set of shifts, and reports its cost, with lower costs being more-similar texts.)
It strikes me that sets-of-doc-vectors could be treated the same way, and so the bag-of-doc-vectors associated with one user need not be reduced to any single average vector, but could instead be compared, via WMD, to another bag-of-doc-vectors, or even single doc-vectors. (There's support for WMD in the wmdistance() method of gensim's KeyedVectors, but not directly on Doc2Vec classes, so you'd need to do some manual object/array juggling or other code customization to adapt it.)

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

OCR correction with prior transcription?

I have a range of documents imaged adn available in tiff, jpeg and pdf.
Many have been transcribed and the transcriptions checked for accuracy.
I want to create pdfs and wonder if there is a way to OCR the images and to correct to the verified transcriptions or to 'insert' verified transcription during the OCR process?
I have access to Omnipage, Abbyy Finereader and Tesseract but I don't know if what I want to do is at all possible.
Jack. Thanks for the clarification.
In short, the transcribed data has little-to-no benefit to any OCR process you are able run easily, with an exception of a highly customized custom-developed application that will do fuzzy per-word lookups from OCRed text in specific places of your transcribed data. In that custom application, you would use regular OCR (any one you named), but preferably some kind of OCR that provides you with coordinates of processed text (OCR-IT API with export to XML), or some kind of SDK that gives you object-based access to text. Then as part of post-processing your application could refer back to transcribed data, assuming you have a way to identify where in transcribed data you are at any moment, or at least performing full text search and being able to identify correct instance in case multiple instances are found. Your transcribed data probably does not have coordinates to link text back to original images where text came from. If similar data is found, and there is character difference, your application could take transcribed data and replace (i.e. correct) OCR-ed data with it. This most likely will not work for hand-written text as regular OCR will produce noise from it, not sufficient for even fuzzy lookup. Once all data replacement has been done, then your application will need PDF export creation capability, for which again some library could be used.
The whole process is complex, and hit-or-miss in some cases, especially around hand-written text. If you had a huge amount of these images+data, then it may be worthwhile to spend days (if not weeks) on developing such specialized application to crunch all that data. Cost analysis needs to be performed.
Aside from hand-writing, modern top-quality OCR (ABBYY, Nuance, OCR-IT) should produce high quality text if your images are of high quality. With PDF Text Under Image any OCR errors will be invisible to readers. I would say expectation of 95-99% accuracy out-of box is realistic. This out-of-box option may provide you high enough accuracy with little time or expenses.
There is one benefit that your transcribed data can provide, especially it that data contains specialized or industry-specific words or proper names that may not be found in a common English dictionary (already included with ABBYY and other OCR software). By making a custom dictionary out of your transcribed data, that dictionary can be used by ABBYY OCR to further increase recognition of those special words using out-of-box processing.
Ilya Evdokimov

Facebook Sentiment Analysis API

i want to try and create an application which rates the user's facebook posts based on the content (Sentiment Analysis).
I tried creating an algorithm myself initially but i felt it wasn't that reliable.
Created a dictionary list of words and scanned the posts against the dictionary and rate if it was positive or negative.
However, i feel this is minimal. I would like to rate the mood or feelings/personality traits of the person based on the posts. Is this possible to be done?
Would hope to make use of some online APIs, please assist. Thanks ;)
As #Jared pointed out, using a dictionary-based approach can work quite well in some situations, depending on the quality of your training corpus. This is actually how CLIPS pattern and TextBlob's implementations work.
Here's an example using TextBlob:
from text.blob import TextBlob
b = TextBlob("StackOverflow is very useful")
b.sentiment # returns (polarity, subjectivity)
# (0.39, 0.0)
By default, TextBlob uses pattern's dictionary-based algorithm. However, you can easily swap out algorithms. You can, for example, use a Naive Bayes classifier trained on a movie reviews corpus.
from text.blob import TextBlob
from text.sentiments import NaiveBayesAnalyzer
b = TextBlob("Today is a good day", analyzer=NaiveBayesAnalyzer())
b.sentiment # returns (label, prob_pos, prob_neg)
# ('pos', 0.7265237431528468, 0.2734762568471531)
The algorithm you describe should actually work well, but the quality of the result depends greatly on the word list used. For Sentimental, we take comments on Facebook posts and score them based on sentiment. Using the AFINN 111 word list to score the comments word by word, this approach is (perhaps surprisingly) effective. By normalizing and stemming the words first, you should be able to do even better.
There are lots of sentiment analysis APIs that you can easily incorporate into your app, also many have a free usage allowance (usually, 500 requests a day). I started a small project that compares how each API (currently supporting 10 different APIs: AIApplied, Alchemy, Bitext, Chatterbox, Datumbox, Lymbix, Repustate, Semantria, Skyttle, and Viralheat) classifies a given set of texts into positive, negative or neutral:
Each specific API can offer lots of other features, like classifying emotions (delight, anger, sadness, etc) or linking sentiment to entities the sentiment is attributed to. You just need to go through available features and pick the one that suits your needs.
TextBlob is another possiblity, though it will only classify texts into pos/neg/neu.
If you are looking for an Open Source implementation of sentiment analysis engine based on Naive Bayes classifier in C#, take a peek at It works best on large corpus of words like blog posts or multi-paragraph product reviews. However, I am not sure if it would work for facebook posts that have a handful of words