Category Archives: My Activities

Is this a toxic comment?

False news and toxic comments on the web are no longer merely a nuisance: they can topple governments; or act as a catalyst for communal disharmony. This gives the recently concluded Toxic comment identification competition on Kaggle an added value.

Introduction

The Kaggle Competition was organized by the Conversation AI team as part of its attempt at improving online conversations. Their current public models are available through Perspective API, but looking to explore better solutions through the Kaggle community.

EDA

In terms of the size, the dataset is relatively small with training set containing 134,384 records and test set 117,888.

Training and test sets both contain following fields.

  • ID – a random unique string
  • Comment – the text of the comment

In addition, the training set contains following six binary label fields. These labels are not mutually exclusive: a comment can be both “Toxic” and “Severe_toxic”.

  • Toxic
  • Severe_toxic
  • Obscene
  • Threat
  • Insult
  • Identity_hate

Here is the distribution of classes in the training set.

Distribution of classes

Distribution of classes

Number of tags applied per comment.

Number of tags applied per comment

Number of tags applied per comment

Most frequent toxic words per class, taken from jagangupta’s kernel.

Most frequent toxic words

Most frequent toxic words

If interested, Jagangupta’s brilliant kernel explores the dataset in depth and presents a detailed report.

Text pre-processing

Little text pre-processing helped improve the results somewhat in the range of ~0.0005 mean ROC AUC (more on evaluation later). Replacing IP addresses by a token, replacing abbreviations, emojis and expressions such as “gooood” with appropriate/normalized words helped. Removing stop-words seemed to be hindering sequence models than helping them, which means models have been able to use at least some of the words. I also kept few symbols such as “!” and “?” since both fastText and GloVe embeddings have representations for symbols.

def clean(comment):
    """
    This function receives comments and returns a clean comment
    """
    comment=comment.lower()
    # normalize ip
    comment = re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"," ip ", comment)
    # replace words like Gooood with Good
    comment = re.sub(r'(\w)\1{2,}', r'\1\1', comment)
    # replace ! and ? with ! and ? so they can be kept as tokens by Keras
    comment = re.sub(r'(!|\?)', " \\1 ", comment)   
        
    #Split the sentences into words
    words=comment.split(' ')
    
    # normalize common abbreviations
    # replacements is a dictionary loaded from https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view 
    words=[replacements[word] if word in replacements else word for word in words]
    
    clean_sent=" ".join(words)
    return(clean_sent)

Another interesting point is the maximum number of features used: it stops having any noticeable effect after a certain threshold. So to save the computation cost and time, it’s worth setting a limit. In Keras tokenizer, this can be achieved by setting the num_words parameter, which limits the number of words used to a defined n most frequent words in the dataset. In this case, I settled for 100,000 as the maximum number of words used for models.

It is same with the length of features used to represent a sentence, and can be selected by looking at the following graph.

Number of words in a sentence

Number of words in a sentence. Source: https://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras

Sentences longer than a given threshold need to be truncated while shorter sentences need to be padded to fit the length. This is required before feeding a dataset to a sequence model because the model needs to have a defined number of units. In all experiments detailed, I used 200 as the sequence length as it didn’t have any noticeable difference using more features.

from keras.preprocessing import text, sequence

maxlen = 200 # length of the submitted sequence
EMBEDDING_FILE = './data/fasttext/crawl-300d-2M.vec'

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

x_train = train["comment_text"].fillna("fillna").values
x_test = test["comment_text"].fillna("fillna").values

# default filters parameter removes symbols ! and ? which we want to keep
tokenizer = text.Tokenizer(num_words=max_features, filters='"#$%&()*+,-./:;<=>@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(list(x_train) + list(x_test))
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

# pad sentences to meet the maximum length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# build a mapping of word to its embeddings
def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE))

# build the embedding matrix 
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Models

The evaluation of models was based on the mean column-wise ROC AUC. That is, averaging the ROC AUC score of each column.

Though I tried a number of models and variations, they can be roughly summarized to five models.

  1. Logistic regression model combining unigram, bigram and character level features with td-idf weighting
  2. LSTM/GRU based model with GloVe and fastText embeddings
  3. LSTM/GRU + CNN model with GloVe and fastText embeddings
  4. A deep CNN model
  5. Sequence layer with attention

This is each model in detail.

    1. Logistic regression

The final LR model was a combination of three sets of features extracted from two unigram, bigram td-idf weighted 10,000 most frequent words combined with a character level tokenizer of lengths 2 to 6 (highest 50,000 features). With a single cross validation split of 5% this model achieved a highest LB score of 0.9805.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scipy.sparse import hstack

class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# unigram feature extractor
unigram_vectorizer = TfidfVectorizer(
    sublinear_tf=True, strip_accents='unicode',analyzer='word', ngram_range=(1, 1), 
    use_idf=1, smooth_idf=True, stop_words='english', max_features=10000
)
unigram_vectorizer.fit(all_text) # all_text is concat of training and test text
train_unigram_features = unigram_vectorizer.transform(train_text)
test_unigram_features = unigram_vectorizer.transform(test_text)

bigram_vectorizer = TfidfVectorizer(
    sublinear_tf=False, strip_accents='unicode', analyzer='word', ngram_range=(2, 2),
    use_idf=1, smooth_idf=True, stop_words='english', max_features=10000
)

bigram_vectorizer.fit(all_text)
train_bigram_features = bigram_vectorizer.transform(train_text)
test_bigram_features = bigram_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True, strip_accents='unicode', analyzer='char',
    stop_words='english', ngram_range=(2, 6), max_features=50000
)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

train_features = hstack([train_unigram_features, train_bigram_features, train_char_features])
test_features = hstack([test_unigram_features, test_bigram_features, test_char_features])

for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(solver='sag')

    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
    1. A sequence model with bi-directional LSTM/GRU with embeddings

Essentially these are Bi-directional LSTM or GRU models taking embeddings as input. Initially I tried with a many-to-one model (returning only the last state) for the sequence layer, but this performed poorly. Influenced from the community, sequence layer was changed to return all states of units, and these time-distributed signals were captured by two pooling layers (average and max), and then concatenated and used as input for the final layer: six densely connected activation units (sigmoid) representing six output labels. This performed much better, achieving around 0.9830 on LB easily.

I tried few variations of the same model. Among them, a bi-directional LSTM/GRU layer connected to a time-distributed dense layer (figure a) and a multi-layer bi-directional LSTM/GRU model (figure b) are noteworthy. The multi-layer model seemed to be overfitting, and at best performed comparable to a single layer when regularized heavily with a high dropout rate. The single bidirectional LSTM layer connected to a time-distributed dense layer with a moderate dropout rate, however, performed a little better than a single bidirectional LSTM layer. So this last model was used for further model averaging.

One observations from this experiment was that it didn’t have much difference in accuracy whether it was LSTM or GRU units were used for the sequence layer. However, different embeddings had a noticeable difference. I tried with fastText (crawl, 300d, 2M word vectors) and GloVe (Crawl, 300d, 2.2M vocab vectors), and fastText embeddings worked slightly better in this case (~0.0002-5 in mean AUC). I didn’t bother with training embeddings since it didn’t look like there was enough dataset to train. This lecture explains what might happen when trying to train pre-trained embeddings on a small dataset.

For building models, Keras was used with default Tensorflow backend. As for the hardware, AWS P2 instances (Tesla K80 GPUs) were used.

LSTM-Dense-Pooling

a) LSTM-Dense-Pooling

Multi-layer Bi-GRU

b) Multi-layer Bi-GRU

def build_model(): # figure (a) as a Keras model
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
    x = SpatialDropout1D(0.4)(x)
    x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.0, dropout=0.2))(x)
    x = TimeDistributed(Dense(100, activation = "relu"))(x) # time distributed  (sigmoid)
    x = Dropout(0.1)(x)
    
    # global pooling layer
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(6, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)

    return model

model = build_model()
opt = Nadam(lr=0.001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.summary()
    1. LSTM/GRU + CNN model with GloVe and fastText embeddings

This model is an attempt at combining sequence models with convolution neural networks (CNNs). At the beginning I tried a convolution layer passing signals to the sequence layer. But it seems swapping these layers — embeddings feeding to LSTM first and then using a CNN on each LSTM unit’s state — and pooling to an output layer brings better results. This study and kernel better explain this model.

CNN-LSTM

CNN-LSTM-pooling model

def build_model(): # figure (a) as a Keras model
    input = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix])(input)
    x = SpatialDropout1D(0.4)(x)
    x = Bidirectional(LSTM(80, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(x)
    x = Conv1D(filters=64, kernel_size=2, padding='valid', kernel_initializer="he_uniform")(x)
    x = Dropout(0.2)(x)

    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)    
    conc = concatenate([avg_pool, max_pool])

    output = Dense(64, activation="relu")(conc)
    output = Dropout(0.1)(output)   
    output = Dense(6, activation="sigmoid")(output)        

    model = Model(inputs=input, outputs=output)    
    return model 

As a variation, I tried to emulate bigrams and trigrams using kernel sizes of 2 and 3 in the CNN layer and concatenating their outputs through pooling layers. But again, this seemed to overfit.

    1. A deep CNN model

This model is based on this paper and this kernel. In summary, it consists of multiple layers of convolution and pooling layers with skip layer connections. Unfortunately, this didn’t perform up to the mark of above two models, both fastText and GloVe embeddings scoring 0.9834 and averaging to 0.9843 on LB. It could be either due to not being able to spend time on tuning it or overfitting again. Still this could work with more data, and indeed, it has according to stats reported in the paper.

    1. Sequence layer with attention

This was a half hearted attempt. But still I wanted to try out how well attention works out. I used an attention layer and a secondary LSTM layer. Strangely, it couldn’t surpass the first two models without attention. Probably, it could have with more time spent on tuning it, but the training time needed was higher than other models.

Attention with LSTM

Attention with LSTM (Source: https://www.coursera.org/learn/nlp-sequence-models/)

Model averaging and ensemble

Due to the relatively small dataset size, there seemed to be a case of overfitting. So it’s not surprising that model averaging and regularization showed a strong positive effect on the prediction accuracy. It was done through various forms: one was through stratified 10-fold training, and this improved the performance noticeably, though obviously it took more time. When averaging folds, weighted average or ranked average performed slightly better than taking the mean of predictions.

Another interesting tidbit is how to use stratified-k fold with multi label classification, since popular stratified k-fold scikit-learn function only supports single label splits. Here, numpy.packbits can come in handy.

import numpy as np
from sklearn.model_selection import StratifiedKFold

pred = np.zeros((x_test.shape[0], 6))
y_packed = np.packbits(y_train, axis=1)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=32)

for i, (train_idx, valid_idx) in enumerate(kfold.split(x_train, y_packed)):
    print("Running fold {} / {}".format(i + 1, n_folds))
    print("Training / Valid set counts {} / {}".format(train_idx.shape, valid_idx.shape))

    # train the model

Model averaging was done again using different embeddings: here, the predictions produced by the same model using GloVe and fastText embeddings were averaged. This step improved the final accuracy of predictions significantly: the best single LSTM-CNN model achieved 0.9856 on LB after averaging predictions from two embeddings, where GloVe and fastText only got 0.9850 and 0.9854 respectively using the same model.

Towards the end, the competition turned into an ensemble madness. I would have liked to try some stacking, which in my opinion is the better way of combining models than by randomly conjured up coefficients.

Improvements and notes from the community

In my opinion, the best feature of Kaggle competitions is the collaborative learning experience. Here are some of the effective techniques that have been used by other teams.

    1. Augmenting the train/test dataset with translation

Using translations is an interesting method of augmenting the dataset, and it has worked wonderfully without information leaking. The technique is quite simple: translate sentences to few nearby languages, and then translate them back to English. When you think about it, this makes sense since it can help reduce overfitting by adding more data with variations in the sentence structure. More details can be found from this kernel.

    1. Adding more embeddings

It seems much of the complexity in this case comes from the embedding layer; and using more embeddings helps more than using different structures. So using all variations of GloVe, Word2vec, LexVec and fastText (e.g., Crawl, Twitter, Wikipedia) can help by averaging resulting predictions. More on various pre-trained embeddings can be found from here and here.

    1. Byte-pair encoding (BPE)

Usually in any text processing work there is a large number of out of vocabulary words. In other words, these are the words that are not found in the embeddings. The standard way of handling such words is to use a token such as <unk> and moving on. Byte-pair-encoding has shown better results handling these kind of words by breaking the word into subwords — similar to phoneme in speech recognition — and using embeddings of these subwords to arrive at the full word. More on this can be found from here.

    1. Capsule networks

Few teams had tried Geoffrey Hinton’s Capsule Networks, and they have reported to be overfitting. Anyway, it’s something worth trying out.

Useful resources:

Tagged

The necessity of lifelong learning

Live as if you were to die tomorrow. Learn as if you were to live forever.
― Mahatma Gandhi

The term “lifelong learning” sounds nonsensical when you consider that learning from experience is an intrinsic function built into all humans and animals. But today, this term in the context of rapid advances in the field of AI and automation carries a different meaning. This is an attempt at discussing why it’s increasingly needed today, and encourage everyone to take up on actively learning and expanding your horizons if you haven’t started already.

The pace of technological advancement

The consensus is that what you learn today will be out of date within 5-10 years from now. By that argument alone, it’s a no brainer that we should keep learning. The pace of advance is almost tangible when it comes to technical fields and not taking time to update yourself would be a critical carrier mistake. Since my experience is with computer science, this post will focus more on CS but I believe it holds true for most other areas as well.

I doubt there’s any other field that’s advancing as fast as CS at the moment (definitely subjective:)). Most of us working in the field acknowledge this fact and accept the challenge, and even call it an endearing quality. At any rate, the changing of tools is expected every 5-10 year period in CS so this shouldn’t be anything new. However, just changing of tools will not be enough if you want to get into emerging CS domains such as Internet of Things (IoTs), Software Defined Networking (SDN), Deep learning .etc, that generally have strong theoretical foundations. Here, online courses can help in two ways.

1. You probably will need more maths and/or computer science fundamentals such as operating systems, networks, algorithms .etc. This is where MOOCs and especially Khan academy can be of great help. They can help us revise old maths lectures and fundamentals.

2. Once in a while there are wonderful offerings on emerging topics by pioneering researchers, and these courses can really bring you to the “edge” than what you would normally find in a regular class.

Automation and consequences

Marc Andreessen famously wrote sometime ago software is eating the world; now probably it’s time to say specifically that artificial intelligence is eating the world, or at least it’s going to. With ever increasing computational power and lifelong efforts by some great scientists, today we are seeing very exciting advances happening on weekly basis. Even though it took self-driving cars and Watson to bring AI to the mainstream, AI has been here for almost as long as the computer itself. From coining of the term in 1956, it has undergone through various stages of evolutions. From the golden era of logic based reasoning to the perceptrons and subsequent AI winter through to the advent of neural networks and current deep learning frenzy: AI has indeed come a long way.

There’s no question of this wave of AI and automation going to affect the way we work. The question is how much it’s going to change; and do we really need to worry? After all, during the last century the world saw some major revolutions in the way humans work and why this should be any different? With every major disruptive innovation, there have been both expiration of traditional jobs and creation of new jobs.

One main difference I see with AI based automation is that it’s not trying to emulate a single function like traditionally how it has happened. For example, horse-driven carriage to automobiles, or papers to digital media have revolutionized human civilization as we know it. But in each of these cases they were limited to one specific area. When we think of what’s happening today with AI, it’s trying emulate some skills that have been intrinsically marked as human territory and doing so to the degree of human precision: cognition and decision making key among them. With such faculties been outsourced to machines, there’s no telling of how widespread the affect will be.

While machine learning researchers caution the world to brace for mass outbreaks of unemployment cycles, some opinion the effect will be similar to disruptions happened in the past. While I agree with the former school of thought, I doubt anyone has a good estimation. This is probably why the Whitehouse policy paper for AI discusses on both overestimated and underestimated influences. Indeed some effects are quite unexpected. But looking at how things are going, we can already see some industries like transportation are due for a rude disruption. Here is another estimation of what type of jobs are more prone to overtaking. It can be expected that single-skill jobs will continue to decay while jobs that require social or maths skill will remain largely unaffected or get more demand.

In summary, think we can all agree on that this wave of AI is going to affect how we work, and as the wise say: it’s good to be safe than sorry. If you still think this may be into the far future, time to think again.

Technology domain is interconnected

Again this is mostly with regards to computer science, but it may hold true in other fields as well. Today, to get some meaningful work done, you usually need to tread upon at least a few cross disciplines. If you are a software engineer, it’s not enough to know the fundamentals and a few languages; depending on your flavour, it may be into systems, embedded systems. etc or distributed systems, web security, big data and ilk. If you are into data science — a cross discipline to begin with — there’s no escaping from learning, from statistics to CS and everything in between! Each of these field is vast on its own and advances rapidly just like most areas in CS. In that sense, the words “Try to learn something about everything and everything about something” is apt today than any other time.

With such a large scope to draw from and a rapidly advancing industry, I doubt any traditional college can satisfy the need no matter how good the degree program is. Fortunately, today we don’t have to look beyond our browser to learn whatever the topic we need to learn and the only question is whether we are ready to expand our horizons.

A modicum of balance to a knowledge driven world

With the ever persistent brain drain from developing countries and today’s demand for knowledge driven industries, most of the countries are at a severe disadvantage. With the imminent wave of automation, this kind of overwhelmingly biased world doesn’t look promising to begin with. Luckily, some very wise people, who are also happen to be leading machine learning researchers, kicked off the drive for today’s online learning initiative in parallel to the rise of AI (this is not anyway discounting the wonderful service rendered through MIT opencourseware prior to the arrival of MOOCs). So it’s not an exaggeration to call such learning initiatives as great equalizers in education and a step towards improving world’s future living standard. As with everything today, some of them are increasingly getting money driven now, but still they have started something that could change the world for the better.

What should we learn

Little humble bragging: I was an early adopter into MOOCs (as they were coined later) in 2011 and finished both Prof. Andrew Ng’s first online machine learning course, which went on to become Coursera, and the first intro to Artificial intelligence course by Prof. Sebastian Thrun and Peter Norvig, which was the start of Udacity. From then to date, I took part in many courses, but as the norm with MOOCs finished only a dozen or so in truth. Anyway, I’d say I have a fairly good rapport with MOOCs as you can get, and would like to share few tips solely based on my subjective experience.

When it comes to learning, you can spend time on lots of things very similar but gain very little in return. In that sense, the classic “Teach Yourself Programming in Ten Years” by Peter Norvig is something everyone should read on what to learn.

Another lesson I learnt is that even though courses are free and limitless, your time is not. So even though a course is really interesting, I now carefully take time to decide whether that’ll help me to expand my knowledge in something I really need. Also rather than trying to keep up with bunch of courses at once and not getting anything fully done, restricting yourself to few depending on your schedule and fully concentrating on them is far better. Again, this is a no brainer, but our impulse is to grab everything free.

Another recent development is all the online services are introducing specializations and mini-degree programs. I have doubts whether this is the best way to go from a learner’s point of view. One of the advantages of online learning is that you are not restricted by any institutional rules to select what to learn or from where. But with this type of mini-degree programs, we are again bringing in traditional restrictions to learning. Instead I’d prefer to select my own meal, and if they are really good, pay for them or audit until I’m convinced. But again, this is very much subjective.

In conclusion, learning is an intrinsic function built into everyone. But with this new order of the world, learning has turned into a fast track lane and if we don’t catch up to the speed, world may move forward leaving us stranded.

My best 7 free JS modal boxes

I have been working on a new iteration of HOT (follow @hotelotravel for more info) for last few weeks and thought of changing the existing JQuery UI Dialog box for something bit fancy and solid (on other hand I may have just wanted to get a break from usual PHP stuff and to play with JQuery a bit after some time). I did have some popular JS modal box names such as Lightbox, Facebox and Thickbox that I wanted to test and found some few new names on the way.Certainly there will be many more modal boxes out there that I’ve missed and not to mention my requirement will be different from yours, but here is the gist of each modal box in my opinion of those I’ve tested so it may help someone to pick the right one at the right moment.

JQuery UI Dialog

JQUery UI in Hotelotravel

JQUery UI in Hotelotravel

This is the dialog box that I’ve used in most cases and of course it’s great. Few years back when I started working with it I noticed few issues when closing the dialog and such but by now they have all been fixed. Also it’s continuously maintained by JQuery community so you can be sure it is solid. What I like mostly about it is its simplicity as well as customization power through various callbacks when you need more action (you can define what to do when a drag starts, drag stops, box shows up, box closes. etc) through various option settings.

This is all you have to do to get a simple dialog box if you have a div with the id of “myDialog”.

$("div#myDialog").dialog();

Another perk of this modal box is its file size, which is quite small (about 10KB and minimized version is about 6KB) and when doing a complex site with numerous CSS and JS scripts, size of each file becomes crucial to maintain a small load size to reduce the load time as well as save server bandwidth. Only seemingly downside of this box is that it doesn’t come with any fancy preloaded stuff (themes, effects, preload-images.etc) but for someone interested (and with a bit of JS and CSS knowledge) can customize them.

LightBox

Light box

Light box

This is focused mainly on presenting pictures and does a good job at it. If you are interested in creating something like a picture gallery without touching much JS, this could be the ideal JQuery plugin for you. Also it has a relatively low size with a size around 19KB and packed version is about 6KB. But since I was looking for something more with raw customization power, this wasn’t the choice for me.

FaceBox

Facebox

Facebox

Another popular choice for a modal boxes and it deserves the name. It has a very small file size and a simple code. It also comes with a default theme and can be a convenient choice for hasty tasks or people 😛 But the downside I noticed is that it gives very small customizing power to the user through JQuery code (which is the case with Light box also btw).

It’s simple to a fault and you just have to name the class name of the link you are going to pop the box as “facebox”. It could have had few more options such as to define the basepath of the project as I couldn’t find a way to define it for pre-loading images without hacking the code nor a way to give width and height manually or pass callbacks. Also this project seems to have been abandoned for a while now and if you are considering adapting this box for your whole site check it thoroughly.

ThickBox

ThickBox 3.1

ThickBox 3.1

ThickBox is really a cool JS box. It’s simplicity and extremely small file size makes it very adorable. This modal box gives some customizing power but still focuses mainly on simplicity and link naming as “thickbox” which is its magic word. However as mentioned earlier this has a better customizing power through JS than FaceBox or LightBox so it’s more flexible. With some hacking you can also give your own callbacks and options as you like.

LightWindow

LightWindow

LightWindow

This is an all purpose, very fancy looking modal box done using script.aculo.us and prototype libraries. It can host all kind of media types and even flash clips which is really impressive. But the obvious downside is its huge file size of about 60KB which is a huge amount when considering this will be just a small part of Hotelotravel.com and for general use it’s not tolerable IMHO. It is not compressed or packed by default so you can manually minimize it using something like YUI compresser to bring it to somewhere around 30KB but even then it’s too large for me. But this is ideal for a site that has tons of ajax stuff and in need of a very customizable modal box with lot of options and callbacks that can be used throughout the site.

SimpleModal

SimpleModal

SimpleModal

This is the smallest JQuery modal box I’ve seen and if you need to customize your modal box to the extreme I’d suggest this. Downside is that you will have to write a lot to get something done through this modal box but extreme small size complements that.

NyroModal

NyroModal

NyroModal

This is also a very good looking JQuery based modal box with a reasonable file size(about 14KB for packed version) and gives considerable customizing power to the user. Also it comes with a packed theme and all that so you can use it easily without much coding which is a plus. In fact, I had some trouble deciding whether to use this one or JQuery UI for my task but finally settled on using JQuery UI because my familiarity with it and community backing. So I guess in my case JQuery UI is the rightful winner 🙂
[digg=http://digg.com/programming/Best_7_free_JS_model_boxes]

Tagged , ,

PHP + Large files

I was working on the project Hotelotravel for last few months and as usual in many cases it involved working with large database files because when you consider all hotels, locations and images all over the world it means a lot. But if we want to do large file uploads or database updates with PHP there are few configurations to be done to default settings and I’m putting this as a note to myself (I’m always keep forgetting this) as well as to any one who may find this useful like when importing a large backup file through phpMyAdmin.

In your php.ini check for these settings and change them as you need.

  • post_max_size (The maximum size of post data you can send in one submission)
  • upload_max_filesize (Maximum size of file that can be uploaded)
  • memory_limit (Maximum memory limit that can be allocated for a script execution)
  • max_execution_time (Maximum time limit for a script execution)

As a side note, if you trying to import large files (backups.etc) through phpMyAdmin and it refuses, you may need to edit config.inc.php file and change these settings to 0 which means no limit.

  • $cfg[‘ExecTimeLimit’]
  • $cfg[‘MemoryLimit’]

As a final note, these settings are there for a purpose. So my advice is change them in whatever manner  you want in a development environment but be very careful when setting them in a production environment because an endless execution of a script can cause your servers to waste bandwidth and even crash.  So I guess this is my disclaimer 😉

Tagged ,

Closing Tributes

It’s bit late for for a closing tribute on SoC, but something is better than nothing, right ? So here we go.

Google Summer of Code 2008 was the second SoC I participated in and last year with Gnome was my first. This time the SoC was with Eclipse and it was in one word ‘Awsome’.

In more details, the summer with Eclipse community was a great experience, Specially my mentor David Carver who supported me in every stciky situation in the process and believe me when I say without his support it would have been a nightmare for me (one badspot with Eclipse is its lack of up-to-date documentation in certain areas). So my heartfelt thanks go to David and all the Eclipse community who helped me to have an easy learning curve.

You can check all 2008 SoC Eclipse projects here. More info on the project I was working on – Eclipse XQuery editor can be found here. The code can be checked out from Eclipse incubator repo from here – eclipse-incub.cvs.sourceforge.net/cvsroot/eclipse-incub/org.eclipse.wst.xquery. Also we have setup a download site for the plugin. If you are interested, please get a copy for yourself and check it out and let me know what you think of it.

Lastly, I’m hoping to continue the project and already have a long todo list prepared. Also my thanks go to all who voted me in as an official commiter for the project. Currently paper works are being carried out and hopefully I will soon get CVS access to Eclipse repo as an official commiter.

Btw, I got my SoC shirt and certificate last week and they will be great mementos to remember this summer with Eclipse. For all these great stuff a big thank goes to Google and specially its Open Source division.

GSoC 2008 mementos

Tagged ,